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Tha  concept  playing  a  central  role  in  the  theory  vkicih  vill  be 
described  is  the  notion  that  the  ensemble  of  points  in  signal  space  idiloh 
represents  a  set  of  nonldentlcal  events  belonging  to  a  common  categocy 
must  be  close  to  dach  other  as  measured  by  some  as  yet  unknown  method  of 
measuring  distance,  since  the  points  repMsent  events  which  are  close  to 
each  other  in  the  sense  that  they  are  members  of  the  same  category. 
Mathematically  speaking,  the  fundamental  notion  underlying  the  theory  is 
that  similarity  (closeness  In  the  sense  of  belonging  to  the  same  class  or 
category)  is  expressible  by  a  metric  (a  method  of  measuring  distance)  by 

#• 

which  points  representing  exanples  of  the  category  we  wish  to  recognize 
are  found  to  lie  close  to  each  other. 

To  give  credence  to  this  conjecture,  consider  what  we  mean  ly 
the  abstract  concept  of  a  class.  According  to  one  of  the  possible 
definitions,  a  class  is  a  oollection  of  things  which  have  some  comnon 
properties.  By  a  modification  of  this  thought,  a  class  could  be  character* 
Ized  by  the  common  properties  of  its  members,  A  metric  by  which  points 
representing  examples  of  a  class  are  close  to  each  other  must  therefore 
operate  chiofly  on  the  common  properties  of  the  examples  and  must  ignore,  ' 
to  a  large  extent,  those  properties  not  present  in  each  exa]i;>le.  As  a 
consequence  of  this  argument,  if  a  metric  were  found  which  called  exaiiq>les 
of  the  class  close,  somehow  it  must  exhibit  their  common  properties. 

To  present  this  fundamental  idea  in  a  slightly  different  way,  we 
can  state  that  a  transformation  on  the  signal  spoce  which  is  capable  of 
clustering  the  points  representing  the  examples  of  the  class  must  operate 
primarily  on  the  conwon  properties  of  the  examples,  A  simple  illustration 


UNCLASSIFIED 


*  P- 

\  r\ 


r  -  ;7r'T;r' 

■  - .1  1  ; 


i  J\H 

UllL'  ivL-l*  V  ihltlJ 


rinJR 


A 


UNCLASSIFIED 

of  this  idea  le  shown  In  Figure  1|  whore  the  eiieenhle  of  points  Is  spread 
out  in  signal  space  (only  a  two-dliuenoional  space  is  shown  for  ease  hf 
illustration)  but  a  tranofonnation  T  of  the  space  is  able  to  cluster  tl»  • 
points  of  the  ensemble. 


2  2' 


Figure  1.  Clustering  by  Transformation 

In  the  above  example  neither  the  signal’s  property  represented  by  coordinate  1 
nor  that  represented  by  coordinate  2  is  sufficient  to  descrtbe  the  class^ 
for  the  spread  in  both  is  large  over  the  ensemble  of  points.  Some  function 
of  the  two  coordinates  on  the  other  hand,  would  exhibit  the  common  property 
that  the  ratio  of  the  value  of  coordinate  2  to  that  of  coordinate  1  in  each 
point  in  the  ensemble  is  nearly  unity.  In  this  specific  instance,  of 
course,  simple  correlation  between  the  two  coordinates  would  exhibit  this 
property,  but  in  more  general  situations  simple  correlation  will  not  suffice. 

If  the  signal  space  shown  in  Figure  1  were  flexible  (as,  if  made  of 
a  rubber  sheet),  the  transformation  T  would  express  the  manner  in  which 
various  portions  of  the  space  must  be  stretched  or  compressed,  in  order  to 
bring  the  points  together  most  closely, 
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Although  thinking  of  transformations  of  the  space  is  not  as  general 
as  thinking  about  exotic  vajs  of  measuring  "distance"  in  the  original  spaqa,  - 
the  former  is  a  rigorously  correct  and  easily  risualiaad  analogy  for  many 
important  classes  of  metrics* 

Mathematical  techniques  have  been  dereloped  to  automatically  find 
the  "best"  metric  of  "best"  teansformation  of  glren  classes  of  metrics 
according  to  suitable  criteria  vhich  establish  "best". 

As  any  mathematical  theory,  the  one  vhich  evolved  from  the 
preceeding  ideas  is  based  on  certain  assus^tions.  The  most  basic  assusqjtion 
is  that  the  N-dimensional  signal  space  representation  of  events  exen^liiying 
their  respective  classes  is  complete  enough  to  contain  information  about  ths 
common  properties  which  serve  to  characterize  the  classes.  The  significance 
of  this  assumption  is  appreciated  if  we  consider,  for  example,  that  the 
signal  apace  contains  all  the  Information  that  a  black  and  white  television 
picture  could  present  of  the  physical  objects  making  up  the  sequence  of 
events  vhich  constitute  the  examples  of  a  class.  No  matter  how  ingenious 
the  data  processing  schemes  that  we  might  evolve  are,  objects  belonging  to 
the  category  "red  things"  could  not  be  identified,  because  representation 
of  the  examples  by  black  and  white  television  simply  does  not  contain 
color  information.  For  any  practical  situation  one  must  roly  on  engineering 
judgment  and  intuition  to  determine  if  the  model  of  the  real  world  (the 
signal  space)  is  complete  enough.  Fortunately,  in  most  cases,  this 
determination  may  be  made  with  considerable  confidence. 

A  second  assunqstlon  states  the  class  of  transformations  or  the 
class  of  metrics  within  which  wo  look  for  the  "best".  This  .assumption 
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2.  A  SreCIAL  THEORY  OF  SDglARITI 

2.1  SjjdXarlty 

The  centre!  problem  of  pattern  reco^altlon  le  Tlewed  In  this 
vork  as  the  problem  of  developing  a  function  of  a  point  and  a  set  of 
points  in  an  R-dinensional  apace  to  partition  the  space  into  a  mnber  of 
regions  corresponding  to  the  categories  to  which  the  known  set  of  points 
belong.  A  convenient  special— tet  not  essential— of  thinking  about 
this  partitioning  function  is  to  consider  it  formed  from  a  set  of  functions 
one  for  each  categoiy,  where  each  furction  measures  the  "likelihood"# 
with  which  an  arbitrary  point  of  the  space  could  best  fit  into  the  par¬ 
ticular  function's  own  category.  In  a  sense,  each  function  measures  the 
similarity  of  an  arbitrary  point  of  the  space  to  a  category  and  the  par¬ 
titioning  fbnctlon  assigns  the  arbitrary  point  to  that  category  to  which 
the  point  is  most  similar. 

The  foregoing  concept  of  partitioning  the  signal  space  is 
Illustrated  in  Figure  2  where  the  signal  space  has  two  dimensions  and  the 
space  is  to  be  partitioned  into  two  categories.  In  Figure  2a,  the  hei^ 
of  the  surface  above  the  x-y  plane  e3q>ressee  the  likelihood  that  a  point 
belongs  to  Category  1,  while  that  of  the  surface  in  Figure  2b  erprresses 
the  likelihood  that  the  poirtt  belongs  to  Category  2,  The  intersection 
between  the  two  surfaces,  shown  in  Figure  3a  and  b,  marks  the  bour^iary 
between  Region  1  where  points  are  more  likely  to  belong  to  Category  1 
than  to  Category  2,  and  Region  2,  where  the  reverse  is  true, 

^  Althougr  the  term  "likelihood"  has  an  already  well-daflneri  n»a^^^n^Tn 
decision  theory,  it  is  used  here  in  a  qualitative  way  to  emphasize  the 
slmilarrity  between  f»indamental  ideas  in  decision  theory  and  in  the 
theory  which  is  here  described. 
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a)  "ULkellhood"  of  Kembarahip  in  Category  1 


b)  "Likelihood"  of  Memberahlp  in  Category  2 

Figure  2.  Likelihood  of  Membership  in  Two  Categories 
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For  each  categoxy  of  Interest  a  set  of  likelihood  ratios  may 
be  coaputed  vhioh  a]q)re88  the  relative  likelihood  that  a  point  In  question 
belongs  to  the  category  of  Interest  rather  than  to  ary  of  the  othsrs. 
ftoB  the  maxiiDua  of  all  likelihood  ratios  which  correspond  to  a  given  point, 
we  may  infer  to  which  category  the  point  most  likely  belongs. 

The  reader  >d.ll  recognize  the  idea  of  making  decisions  based  on 
the  maxlinura  likelihood  ratio  as  one  of  the  Important  concepts  of  decision 
theory*  The  objective  of  the  preoedlng  discourse  Is,  therefore,  slj^ply 
to  make  the  statement  that  once  a  function  measuring  the  likelihood  that 
a  point  belongs  to  a  given  category  is  developed,  there  is  at  least  one 
well-established  precedent  for  partitioning  signal  space  into  regions 
which  are  associated  with  the  different  categories.  The  resulting  regions 
are  like  a  template  which  serves  to  categorize  points  depending  vpon  whether 
they  are  covered  or  are  left  uncovered  by  the  template,  Althou^  in  the 
rest  of  this  chapter  partitioning  the  signal  space  is  based  on  a  measure  of 
similarity  which  resembles  the  likelihood  ratio  only  in  the  manner  in  which 
it  is  used,  it  is  shown  elsewhere  that,  in  certain  cases,  decisions 
based  on  the  measure  of  similarity  ar,9  identical  to  those  based  on  the 
maximun  likelihood  ratio. 

One  might  wonder  whether  the  error  criterion  by  which  similarity  to 
a  class  of  things  is  measxired  should  be  based  on  known  members  of  the  class 
only,  or  also  on  the  additional  knowledge  gained  from  a  set  of  things  liiich 
do  not  belong  to  the  class.  The  philosophical  question  posed  by  these  two 
possibilities  is  whether  one  is  aided  in  learning  to  recognize  membership  in 
a  category  if,  during  the  period  of  learning,  examples  of  nonmembers  of  the 
category  are  also  given.  It  seems  plausible  that  increasing  the  knowledge 
available  on  members  and  nonmembers  of  the  category  may  better  the  separation 
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between  oatogories.  There  are  significant  cs.tegories,  however,  where 
knowledge  of  noncienbera  does  not  help  to  determine  what  oonstiintes 
nembership  in  the  categor7.  The  arialogous  sijtuation  in  decision  theoiy 
is  pointed  out  later. 

In  the  first  three  chapters  of  thiif  report^  a  quantitative  asasure 
of  similarity  is  developed  in  a  special  theoiV  where  similarity  is  con¬ 
sidered  as  a  propertj'  of  only  the  point  to  tej  co.mpared  and  the  set  of 
points  which  belong  to  the  categorj’  to  be  lealrnod.  In  later  chapters, 
however,  methods  will  be  discussed  for  lettlhg  knovm  nonmembers  of  the 
class  influence  the  development  of  measures  df  similarity. 

In  the  special  theory  of  the  first  three  chapters,  similarity  of  an 
event  P  to  a  category  is  measured  by  the  closeness  of  P  to  every  one  of 
those  events  known  to  be  contained  in  the  category.  Similarity  S 
is  regarded  as  the  average  "distance"  bctv;con  P  and  the  class  of  events 
represented  by  the  set  of  its  examples. 

Two  things  should  be  noted  about  the  foregoing  definition  of 
similarity.  One  is  that  the  method  of  measuring  distance  does  not 
influence  the  definition.  Indeed,  distance  is  not  meant  here  in  the 
ordinary  Euclidean  sense;  it  r.^  mean  "closeness"  in  some  arbitrary, 
abstract  property  of  the  set  ^F^^  vrfiich  has  yet  to  be  determined.  The 
second  thing  to  note  is  that  the  concept  cf  distance  between' points,  or 
distance  in  general,  is  not  fundamental  to  a  concept  of  similarity.  The 
only  aspect  of  similarity  really  considered  essential  is  that  it  is  a 
Teal  valued  function  of  a  point  and  i  set  which  allows  the  ordering  of 
points  according  to  their  similarity  to  the  s©t.  The  concept  of  distance 
is  introduced  here  as  a  mathematical  convenience  based  on  intuitive  notions 
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pf  sinilarlty.  It  vUl  be  ^parent  later  hov  thla  forma  part  of  tha 
aaswptionB  atatad  in  tj^e  Introdoetion  aa  onderl/lng  the  'theory  to  be 
preaantad.  Stan  with  the  Introduction  of  the  concept  of  diatanea  there 
are  other  vagra  of  defliiinc  aieiilarl^.  Neamaea  to  tha  oloaeat  member 
of  the  aet  la  one  aueh  poaelbility.  Thla  ImpUee  that  an  event  ie  alallar 
to  a  claaa  of  evente  if  it  ia  oloae  in  aone  senae  to  any  member  of  tha  claaa* 
It  ia  not  the  purpoaa  of  thia  chapter  to  philoaophlze  about  the  relative 
merita  of  theae  different  waya  of  defining  ainilarlty.  Their  advantagaa 
and  diaadvantagea  will  become  apparent  aa  thie  theory  la  developed,  and  the 
reader  will  be  able  to  judge  for  hlnaelf  which  set  of  aasunptiona  la  moot 
applicable  under  a  given  aet  of  circumatancea. 

To  aummarlte  the  foregoing  remarks,  for  the  puipoaee  of  the 
special  theory,  similarity  s(p,  {F“jj^)of  a  point  P  and  a  eet  of  points 
ejoenplifylng  a  claaa  will  be  defined  aa  the  average  distance  between 
the  point  P  and  the  M  menbera  of  the  set  ,  Thie  definition  le  expreesed 
by  ajuation  2.1,  where  the  metric  d(  )— the  method  of  measuring  distance 
between  two  points— is  left  unspecified. 


To  deserve  the  name  metric,  the  function  d(  )  must  satisfy  the*  usual 
conditions  stated  in  Equation  2.2  a,  b,  c  and  d. 
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d(l,B)  -  d(B,A) 

(aymmetrlo  function) 

(2.2a) 

d(A,C)^  d(A,B)  ♦  d(B,C) 

(trlan^  inequality) 

(2.2b) 

d(A,B)  Z  0  . 

(non-nagatlvs) 

(2.2e) 

d(A,B)  •  0  if ,  and*  only  if,  A  «  B 

(2.2d) 

2.2  Optlnlmtlon  and  Feature  Vfelriitlng 

Ita  th«  definition  of  8inilarit7  of  the  preceding  aectlon  the 
average  distance  between  a  point  and  a  set  of  points  served  to  measure 
sisdlarlty  of  a  point  to  a  set.  The  method  of  measuring  distance,  however, 
was  left  unspecified  and  was  understood  to  refer  to  distance  in  perh^)a  some 
abstract  property  of  the  set.  In  this  section  the  criteria  for  finding 
the  "best"  choice  of  the  metric  are  discussed,  and  this  optimization  is 
applied  to  a  specific  and  sliqple  class  of  metrics  which  has  interesting 
and  useful  properties* 

Useful  notions  of  "best”  in  mathematics  are  often  associated  with 
finding  the  extrema  of  the  functional  to  be  optimized.  Ms  may  seek  to 
minimize  the  average  oost  of  our  decisions  or  we  may  maximize  the  probability 
of  estimating  correctly  the  value  of  a  random  variable.  In  the  problem 

t 

above,  a  useful  metric,  optimal  In  one  sense,  is  one  which  minimizes  the< 

average  distance . between  members  of  the  same  set  subject  to  certain  suitable 

constraints  devised  to  assure  a  nontrivial  solution.  If  the  metric  is 

thou^t  of  as  extracting  that  property  of  the  set  in  which  like  events 

\ 

are  clustered,  then  the  average  distance  between  members  of  the  set  is  a 
measure  of  the  size  of  the  cluster  so  formed.  Minimization  of  the  average 
distance  is  then  a  choice  of  a  metric  which  minimizes  the  size  of  the  cluster 
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and  therefore  extraota  that  property  of  the  set  in  which  they  are  aost  ailloa* 
It  Is  only  proper  that  a  distance  masure  shall  winiaiie  the  avertfi  distance 
between  those  events  which  are  selsoted  to  enapUfy  evrats  that  are  "plese*. 

Althoni^  this  preceding  criterion  for  finding  the  best  sciatica 
is  a  veiT  xwasonable  and  neaalngfnl  asswaption  on  which  to  base  the  special 
theoiy,  it  la  by  no  naans  ths  only  possibility*  Hmlaisation  of  the  nail ana 
distance  between  asnbers  of  a  set  is  just  one  of  the  possible  alternatives 
that  lansdlately  suggests  Itaelf*  It  aiould  be  pointed  out  that  ultiaately  ■ 
the  bast  soldtion  is  that  whioh  results  in  ths  largest  nunber  of  oorrsot 
olassifloations  of  events*  Making  the  largest  nunber  of  oorreot  decisions 
on  the  known  events  is  thus  to-be  nazialsed  and  is  itself  a  suitable 
oriterion  of  optiaiBation  whioh  will  be  dealt  with  elsewhere  in  this  report* 
Sines  the  prlmaxy  purpose  of  this  chapter  is  to  outline  a  point  of  view 
regarding  pattern  rsoo^tion  through  a  speoial  axa^>lef  the  diolee  of 
"best"  previously  described  and  stated  in  Bjuation  2*3  will  be  used,  for 
it  leads  to  very  useful  solutions  with  rslative  sinpUeity  of  the  nathsnatios 

t  * 

InTolnd.  In  Bqnntlon  2.3  F  and  u«  the  and  nmbara  of  tho 

e 

of  d(  ). 

Of  the  many  different  mathematical  forms  whioh  a  netric  may 
taksi  in  the  speoial  theory  here  ^described  only  metrics  of  the  form  given 
by  ^nation  2,k  will  be  considered.  The  intuitive  notions  underlying 

♦ 

16 


SJ-VV. 


over  all  choices 


(2.3) 
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th«  oholoe  of  the  Mtrle  in  thia  fbm  are  baaed  on  ideas  of  "feature 
veif^tln^  vhieh  vlll  be  developed  belov. 


In  the  faslliar  Aiclldean  If'dlawneional  apace  the  dletanea  between 

the  two  points  A  and  B  is  defined  Igr  Ekioation  2,S,  If  A  and  B  are 

expreated  in  terms  of  an  orthonomal  coordinate  STStan  ,  then  dCA^B) 

of  Equation  2.5  can  be  written  as  in  Bquation  2,6.  where  a  and  b  , 

n  n' 

reapeotivalyj  are  the  coordinates  of  A  and  B  in  the  direction  of  9^. 


d(A^B)  -  I  A  •  B  • 


(2.5) 


VTe  must  realise,  of  course,  that  the  features  of  the  events 

represented  the  different  coordinate  directions  9^  are  not  all  equally 
* 

inq>ortant  in  Influencing  the  definition  of  the  category  to  which  like  events 

ft 

belong,  Therefore  it  Is  reasonable  that  in  comparing  two  points  feature 
by  feature  (as  le  expressed  in  Equation  2.6),  features  with  decreasing 
significance  should  bo  weighted  with  decreasing  weights  The  idea  of 
feature  weigjitlng  is  expressed  by  a  metric  somewhat  more  general  than  the 
conventional  Euclidean  metric.  The  modification  is  given  in  Bquation  2,7, 
idiore  is  the  feature  weigjiting  coefficient. 
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d(A,B)  -  bjj)  J  (2.7) 

It  la  rasdily  verified  that  the  nbo\-e  netric  satisfies  the  conditions  stated 
in  equation  2.2  if  none  of  the  v: ‘s  is  zero|  if  any  of  the  W  coefficients 
is  zero,  Equation  2, 2d  is  not  s/*ti3fiod. 

It  Is  important  to  note  that  the  above  metric  gives  a  numerical 
measure  of  “closeness”  betveen  tvo  noints,  A  and  B,  which  is  strongly 
influenced  Iqr  the  particular  set  of  similar  events  ♦  This  is  a  logical 
resultj  for  a  rasasure  of  similarity  between  A  and  B  should  depend  on  how 
our  notions  of  similarity  were  shaped  by  the  set  of  events  taiovm  to  be 
similar.  Wien  we  deal  ’/dth  a  different  set  of  events  which  have  different 
similar  features,  our  Judgement  of  similsrity  between  A  and  B  will  also 
be  based  on  finding  agreement  beV'oon  them  along  a  changed  set  of  their 
features. 

An  alternate  and  instructive  way  of  expl^nirig  the  significance 
of  the  class  of  metrics  given  in  Equation  2,h  is  to  recall  the  analogy 
made  in  the  Introduction  regarding  transformations  of  the  signal  space. 

There,  the  problem  of  expressing  what  was  similar  among  a  set  of  events 
of  the  same  category  was  accomnlished  by  finding  that  transformation  of  the 
signal  space  (again,  subject  to  suitable  constraints),  which  will  cluster 
most .highly  the  transformed  events  in  the  new  space.  If  we  restrict 

e 

ourselves  to  those  linear  transformations  of  the  sipial  space  which  involve 
only  scale  factor  changes  of  the  coordinates  and  if  we  measure  distance 
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.A 


tojkl2_jj8w^2g»ce  ty  th#  Biioll(l#tn  natrlOi  than  tha  AioUdatn  dlatanoa  batvaan 
tuo  pointa  aftar  thair  linear  traniforaatlon  la  aqalaalant  to  tha  fhalara 
valeting  raatrle  of  Spiatlon  2.1i.  This  equitalanea  la  ahom  balow,  ifbara 
A'  and  B»  are  Teetora  obtained  ftrom  A  and  B  ^  a  linear  tranafomation# 
flia  most  general  linear  tranafonsation  la  aapraaaad  by  B)uation  2«9«  abara 


of  tha  vector  B« 


a'^  la  tto  coordinate  of  tha  tranaforaad  vector  A  and  b*^^  ia  that 


and  B  ■ 


Vn 
n  n 


M  *WfW 

[l<  -  Bj  -  (a  -  ^  M 


''21*22*  ••*21 


•  •  •  • 


•••*!« 


(2.8a) 

(2.8b) 

(2.8e) 

(2.9) 


The  Euclidean  diatanca  between  A»  and  B',  dg(A»,  B»),  ia  given  in  Bquation 

2.10. 


.-^a^ 


(2.10) 
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If  the  linear  tranaformation  iiwolfea  onlj  ecale  factor  ohangea  of  tha 
coordinatea^  onlx  elaa^nta  on  tha  nain  diagonal  of  tha  Wnatrlx  ara 
non-Boro,  thua  radoeinf  dg(i',  B'),  in  thia  apaclal  eaaa,  to  tha  fom 
gtvan  in  Equation  2.11. 

^clal  dg  (A»,  B»)  •  / ^  ''nn^ 

Tha  above  olaaa  of  natrica  will  be  uaed  in  Equation  2«3  to  winiai  aa  tha 
averaga  dlatanoe  between  the  aet  of  pointa.  Beoauae  of  the  nathautioal 
difficulty  of  ainlrdalng  the  aun  of  aquare  roota  of  quantltlea,  we  will 
ninlnlte  Inatead  the  raean-equare  dlatanoe  when  neobera  of  are 
oosqjared  with  eMh  other. 

The  Bathejnatlcal  fornulation  of  the  above  miniadaation  la  givan 

in  Bquationa  2,12a  and  2,*12b,  The  aigniflcance  of  the  conatraint  2,12b 

la,  for  the  caae  coneldered,  that  every  wei^  w^^  la  a  number  between 

0  and  1  (w  'a  turn  out  to  bo  poaitlve)  and  it  can  be  interpreted  aa  tha 
nn 

fractional  value  of  the  featurea  9-^  which  they  weight,  w^  denotoa  the 
fractional  value  which  ia  aasigned  in  the  total  meaaure  of  dlatance  to  the 
degree  of  agreamant  that  exiata  between  the  con5)onent3  of  the  compared 
vectora. 


minimum. 


(2.12a) 
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(!.Ub) 

Although  the  oonstratnt  of  2.12b  i«  appealing  froa  «  feature- 
vei^ting  point  of  vieu,  froci  a  strictly  nathematieal  stani^int  it  leaves 
muoh  to  bo  dosired.  It  dooe  not  guarantee,  for  instance,  that  a  simple 
shrinkago  in  the  size  of  the  signal  space  is  disallowed.  Suoh  a  shrinkage 
would  not  change  the  relative  orientation  of  the  points  to  each  other, 
the  property  really  requiring  alteration.  The  constraint  given  in 
Equation  ?.13,  on  the  other  hand,  states  that  the  voltsae  of  the  space  ie 
constant  as  if  the  space  were  filled  with  an  incoapressible  fluid.  Rare 
one  merely  wishes  to  detemine  what  kind  of  a  rectangular  box  oould  oontain 
the  space  so  as  to  mininise  the  mean-square  distance  aaong  a  set  of  points 
imbedded  in  the  apace. 


i 


nn 


1. 


(2.13) 


The  minimization  problem  with  both  of  these  constraints  will 
be  worked  out  in  the  following  equations,  and  it  will»be  seen  that  the 
results  are  quite  similar. 

Interchanging  the  order  of  summations  and  expanding  the  squared 

ejqjression  in  Equation  2.12a  yields  Equation  2.1i4,  where  it  is  recognized' 

2  * 

that  the  factor  multiplying  w^'^  is  the  variance  of  the  coefficients  of  the 
0  coordinate.  Minimization  of  Equation  2.II4  under  the  constraint  2.12b. 
yields  Equation  2.15,  where  p  is  an  arbitrary  constant.  Imposing  contraint 
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2«l2b  agaitii  tre  can  aol^e  for  obtaining  Sjuation  2«16* 


2H 

Tr=iT 


M 


<^7 


2. 


•  ••» 


(2.014) 


(2.15) 


V  « 


nn 


(2.16) 


That  the  values  of  w  so  found  ore  indeed  those  ’-ihlch  niniaiize 
of  Equation  2,12a  can  be  seen  by  noting  that  D  is  an  elliptic  para¬ 
boloid  in  an  il-dlne  ns  tonal  space  and  the  constraint  of  2.12b  is  a  plane 
of  the  sa-ne  dimensions.  For  a  throe -dimensional  case,  this  is  illustrated 
in  Figure  li.  The  intersection  of  thedliptic  paraboloid  wtt^  the  plane 
is  a  curve  whose  only  point  of  zero  derivative  is  a  minimum. 

The  ph'^sicnl  interpretation  of  weighting  features  tr^  the 
reciprocal  of  th^ir  variances  is  given  below. 

If  the  variance  of  a  coordinate  of  the  ensemble  is  large,  then 
the  corresponding  w^  is  small,  indicating  that  small  vreight  is  to  be  given 
in  the  overall  measure  of  distance  to  a  feature  of  large  variation.  If  the 

m 

variance  of  the  magnitude  of  a  given  coordinate  6^^  is  small,  on  the  other 
hand,  then  its  value  can  be  accuratelj’'  anticipated;  therefore  9^  should 
be  counted  heavily  in  a  measure  of  similarity.  It  is  important  to  note 
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that  In  the  extreme  case,  whore  tho  variance  of  the  magnitude  of  a  component 

of  the  set  is  zero,  the  corresponding  in  Rjuation  2,16  is  equal  to 

unity  ulth  all  otter  aqua!  to  zero.  In  this  caao,  altteugh  Bonatlon 

2,11  is  not  a  le^Jitinate  metric  since  it  does  not  satisfy  Sjuation 

2,2,  it  is  still  a  meaningful  measure  of  similarity.  If  aiy  coordinate 

occurs  with  Identical  magnitudes  in  all  members  of  the  sot,  then  it  is 

% 

an  "all  irroortant"  feature  of  tho  set  and  nothing  else  needs  to  bo 
considered  in  judging  the  events  similar.  Judging  membership  in  a  category 
by  such  an  "all  important"  feature  may,  of  course,  result  in  the  incorrect 
inclusion  of  nonmombors  into  the  category.  For  Instance  "red,  nearly 
circular  figures"  have  the  color  red  as  a  connon  attribute.  Tho  trans¬ 
formation  described  thus  far  would  pick  out  "red"  ns  an  all  important 
feature  and  would  judge  membership  in  the  category  of  "red,  nearly 
circular  figures"  only  by  tho  color  of  the  oorpared  object.  A  red  square, 
for  Instance,  would  thus  be  misclassified  and  judged  to  be  a  "red,  nearly* 
circular  figure".  Oiven  only  examples  of  the  catertory,  on  the  cthnr  hand, 
such  results  would  probably  be  expected.  Later  on,  however,  where 

t 

labeled  examples  of  all  catepories  of  interest  are  assumed  given,  only 
those  attributes  are  emphasized  in  which  members  of  a  category  are  alike 
and  in  which  they  differ  from  those  of  other  categories. 

It  should  bo  noted  that  the  weighting  coefficients  do  not  nec¬ 
essarily  decrease  monotonically  in  the  above  feature  wei^ting  which 
minimizes  the  mean-square  distance  among  M  given  examples  of  the  class. 
Furthermore,  the  results  of  Equation  2.16  or  2,18  are  independent  of  the 
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particular  orthonomal  systess  of  coordinates.  Bquationa  2,16  and  2,l8 
’  siaply  state  that  the  veighting  coefficient  is  inversely  proportional  to 
the  vartance  or  to  the  standard  deviation  of  the  enserable  along  the 

a 

corresponding  coordinate.  The  nuiwrlcal  values  of  the  variances,  on  the 
other  hand,  do  depend  on  the  coordinate  ^^etea. 

K  ve  use  the  aatbatitlcally  acre  appealing  constraint  of 
Equation  2.13  in  place  of  that  in  2.12b,  ve  obtain  Equation  2,17. 


(2.17b) 


It  is  readily  seen  that  by  applying  Equation  2,17a,  the  expresoion 
of  2.17b  is  equivalent  to  Equation  2.lCa,  vhere  the  bracheted  expreosion 
must  be  zero  for  all  values  of  n.  This  substitution  leads  to  Equation  2.l8b 
which  may  be  reduced  to  Equation  2.l8c  b/  npclication  of  Equation  2.17a  once 

more. 


(2.18a) 

(2.18b) 
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{2.10c) 


Thua  it  is  seen  that  the  fuature  'feif^tinE  JoelTicient  is 
proportional  to  the  reciprocal  stinJard  doviation  cf  the  coordinates, 
therebr.'  lending  itself  to  the  '»a-,e  of  intciprotntion  as  before. 


2.3  Describing  the  Catonora* 

The  set  of  kno-ci  -enb  rs  is  the  boot  -de5cnj.ti&n  cf  the  category, 
FoUo'-lng  the  practice  of  nrocabillty  theory,  this  set  of  ainilar  events 
can  be  described  by  its  st.-'tistics;  the  cnser.blo  .-.oac,  variance,  and 
higher  nonents  can  bo  specif  tod  ns  its  characteristic  preperties.  For 
our  purposes  a  more  sui'.ahlo  'description  ef  our  idea  of  the  category,  on 
the  other  -and,  is  found  in  the  specific  forr.  of  the  function  S  of 
Equation  ?.l  developed  frer.  the  set  '-f  sinilar  events  to  rensure  rer- ership 
in  the  cetecory.  A  r.arhed  disadvantage  of  S  is  that  (in  a  machine  -‘‘nich 
implements  its  application)  the  anount  of  storage  capacity  "hich  must  be 
available  is  nromortional  to  *■  he  na'.h?r  of  events  introduced  and  is  Lnus 
a  gro’/d.ng  quantity.  For  this  reason  a  -escripticn  cf  tho  «ct  of  noints 
is  desired  in  the  form  of  a  point  E  T:hich  nay  be  considersd  most  typical 
of  the  ensemble  of  points  belonging  to  the  set.  Describing  the.  cv-te-ory 
ly  means  of  a  single  point  is  analogous  to  .designating  a  particular 
capital  A  as  characteriring  the  set  cf  -'irf'’rent  capital  A^s  that  are 
encountered.  This  single  A  ta'  es  t'^*-  place  the  entire  eu.scr’ble  of  .•'.’s 
and  represents  it  by  being  the  typifying  example  of  the  set.  The  r.cst 
important  attribute  to  the  tyr-ifying  oroemple,  from  the  point,  cf-vin’;  of 


e 
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corroctly  roprosontlng  the  sot,  is  th<it  t'®  "distance"  treasured  botvieen 

an  arbitraxy  point  P  and  S  should  agree  with  the  noan-3<iuare  distance  ^ 

measured  by  the  function  S  betvean  P  and  s^eRbers  of  the  set.  The  distance 

in  both  cases  is  Roasured  with  the  Rolric  developed  in  t!»  preceding  section. 

The  equality  of  these  distances  is  stated  in  Equation  2.1®,  where  e^^  is 

the  coordinate  of  £  in  the  9  dl  root  ion  and  p„  is  the  coerditiato  of  ?  in 

n  n 

the  sane  direction. 


0 


Interchanging  the  order  of  cu-r.atic-.s,  the  squares,  and  ccllrcting 

like  tem.s  yields  Eiq  intion  2.20, 


(2.20) 


This  equation  does  not  have  a  unlc  :e  solvtion  unless  further  constraints 

are  lioposed.  A  convenient  set  of  constraints  is  the  requirement  that 

the  above  equality  hold  for  any  choice  of  the  metric.  This  can  be 

sho’.-m  to  mean  that  the  equation  must  hold  for  each  n.  Under  this  constraint 

the  unique  solution  for  E  is  given  oy  Sauation  2,21a  and  2.21b, 


E  - 


e  6  , 
n  n’ 


(2.21a) 
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vhere 


2  ”’■5 - =:• 

*  '  2p„  1' 

n  n  n 


(2.21 


The  interesting  aspect  cf  this  n’sult  is  th*:t  the  choice  of  the 
typi^'ing  vector  E  depends  on  r,  the  vector  to  be  conpared  to  the  sot. 

This  fact  does  not  render  E  an:.'  signirican*. .  Instead  cf  esev-ring 
P  vith  evorj'  r.e.'^ber  of  the  set,  as  In  Sc'-satio:!  2.'),  it  is  equivalent 
to  conpare  it  -dth  £,  given  in  Squatie:;  '’.21.  Tim  act  of  Srno'.T.  -enbers 
of  the  category  appears  in  E  as  the  constants  f  and  T"  ,  nhich  r-ay  be 
computed  once  and  for  all.  Triis  f.oct  ha:»  irportant  inplications  regarding 
the  a.TJount  of  infomation  ''hieh  rust  be  stored.  In  the  conparison  of  an 
arbitrar;/  point  P  "it?-,  the  net  of  the  function  S(F,  }, 

,all  I!  rerbors  of  the  net  nunt  bo  storod,  each  having  !f  coordinates.  The 
total  stored  inforration  about  the  i.n  t?iu.n  ?"I  nunbsrs.  In  the 
comparison  cf  P  •■•itr.  S,  on  the  oti-i-’r  the  total  storage  is  only  ??.' 

numbers . 

2,h  Choosing  the  (btir.’-~i  Orthcg-'a-il  Coordinate  System 

Tlio  labeled  '’•.'ents  'unich  belong  to  one  category  have  b.-.en 
assumed  given  as  vectors  in  an  a  priori  selected  coordinate  system  which 
expressed  features  of  the  e\'ont'i  thougnt  role- t'-.  the  determination  of 
the  category.  An  optimum  set  of  feature  weighting  coefficients  were  then 
found  through  which  similar  events  could  be  judged  most  similar  to  one 
another.  It  would  be  purely  coincidental,  however,  if  the  features 
represented  by  the  given  coordinate  system  -’ere  optimal  in  expressing 
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tho  similarities  aisong  nembers  of  t!ie  set.  In  this  section^  therefore,  se 

look  for  a  new  set  of  coordinates,  spanning  the  sane  space,  and  expressing 

a  different  set  of  features  which  miriimize  tho  rean-squtro  distance 

between  members  of  the  set.  The  problem  just  stated  can  bo  thought  of  as 

either  enlarging  tho  class  of  retries  considered  thus  far  in  the  measure 

of  similarity  defined  earlier  or  cs  enlarging  tho  class  of  transforaations 

of  tho  space  within  which,  class  •^‘o  look  for  that  particular  transformation 

which  rinirdzes  the  rear. -square  dista.nca  between  similar  events. 

It  •••as  proved  earlier  i.h.it  t'ra  linear  transformation  which  changes 

the  scale  of  the  n^'*'  dimension  of  the  s:>acs  by  tho  factor  w  while- 

nn 

keeping  the  volume  of  t'-’e  npace  conata-Tt  and  minimizing  the  raan-square 
distance  betv.-een  the  tranafomed  vectors  is  given  by  Equation  2.22, 


F* 


where 


(2.22s) 


and 


(2.22b) 


The  mean-square  distance  under  this  transfoi-mation  is  piven  by  Equation 
2,23  and  is  a  minimum  for  the  given  choice  of  orthogonal  coordimte  system. 


1 


N 


m*l  n=l 


nn 


(f, 


mn 


-  f  )‘ 
pn' 


minimian. 


(2.23) 
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It  !•  poMltl*,  howBTer,  to  rotate  the  coordinate  aireten  until  one  le 
found  vhloh  nlnlidMa  the  above  einlTOia  nean-eqoare  dietanee*  liiereae 
the  flret  ainijdaation  took  place  vlth  reapect  to  all  ehoicea  of  the 

ve  are  now  Intereeted  In  further  minlalting  thle  by  flret  rotating 
the  coordinate  Cretan  ao  that  the  above  optlnun  choice  of  ahould 
reeult  in  the  abaoluta  nlnlRua  distance  between  vectors.  The  solution 
of  the  above  search  for  the  optlnun  tranafomation  nay  be  convsn4.antly 
stated  in  tte  fora  of  the  following  theorea. 

Theoreai 

The  linsor  transfora atlon  which,  after  transformation,  ninlxdsea 

the  nean«equars  distance  between  a  set  of  vectors,  subject  to  the 

constraint  that  the  volume  of  the  space  is  invariant  under  transformation, 

is  a  rotation  [cj  followed  ly  a  diagonal  tranafomation  [w]  .  The  rows 

of  the  matrix  [o]  are  eigenveotors  of  the  covariance  matrix  [uj  of  the 

sst  of  vectors,  and  the  elements  of  [w]are  those  given  in  Equation  2.22b, 

where  tr  i*  the  standard  deviation  of  the  coefficients  of  the  set  of 
P 

vectors  in  the  direction  of  the  p^^  eigenvector  of  ^uj  , 

The  proof  of  the  above  theorem  is  readily  obtained  as  follows. 

Rroof 

Expanding  the  square  of  Equation  2,23  ®nd  substituting  the 

values  of  w  results  in  Equation  2,2li  which  is  to  be  minimized  over  all 
nn 

choices  of  the  coordinate  system. 
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l/N 


Let  th®  given  coordinate  eiyeteBJ  be  transforned  by  the  matrix  [cj  , 

JL  « 


[cJ. 


*^11  ®12  *••  ®1N 
®21  °22  •••  ®2N 


®«1  ®N2  •*•  ®NN 


,  where  X  ■  1  for  p-1,  2.  ...,  N, 

pn 


Equation  2«22i  Is  nlninized  if  the  bracketed  eiqpreseion  In  Equation  2,2ttc 
la  minimized*  The  latter  may  be  named  J3  and  written  aa  below. 


'^‘^p  •  [r  ^  -  <R Pi  ' 

p»l  ^  p»,l  I  m“l  ni"!  J 


idiere 


f  =  >  f  c  , 

mp  -^1  pn 

Subetitnting  Hiitatlon  2,26b  into  2,26a,  we  obtain  Equation  2,27,  where 
the  areraging  is  understood  to  be  over  the  set  of  M  vectors. 


p-1 


M  mn  ms  pn  ps 

n«l  s-1  m=l 


-(t 


(2.2il«) 

(2.2lib) 

(2.2lic) 

(2.25) 

(2,26a) 

(2.26b) 

(2.27) 
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The  squared  expression  may  be  written’ as  a  double  sun  and  the  entire  equation 
slapllfled  to  2.28. 


(V.  -  V.) 


c  c  , 
pn  ps 


(2.28) 


”  ■^n^s^"  '‘ns  "  “sn  oleaent  of  the  covariance  matrix  [u]  , 

Bence 


tTslng  the  inethod  of  Lagrange  multipliers  to  nlninize  /3  in 
aquation  2.29,  subject  to  the  constraint  of  Equation  2,25,  we  obtain 
Equation  2.30  below  as  the  total  differential  of  ^  .  The  differential 
of  the  constraint,  in  Etjuation  2.3l. 


In  the  way  of  an  e3q)lanation  of  Equation  2.30,  it  is  seen  that  when  Equation 
2.29  is  differentiated  with  respect  to  c^^  ,  then  all  the  factors  in  the 
product  in  Sjuation  2,29,  where  p  /■/  ,  are  simply  constant!.  Carrying  out 
the  differentiation  stated  in  Equation  2.30,  we  obtain 
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*'■11  Sjii------ 


TT  «  «  -0* 


(2.32) 


fit 

p/J  8-1 


■*n8®pn®p5  "  • 


(2.33) 


llote  th<it  since  p  Is  Just  a  constant  as  regards  optirsication  of 


any  c  / 


In  accordance  vdth  the  nethwi  of  Lagrange  multipliers^ each  of  the 
!!  constraints  of  Equation  2.?1  is  multiplied  bj'  a  different  arbitrary  constant 
and  is  added  to  d^  as  shovm  bclov. 


^  •  °  *  'i  “^gj-  “• 


(2.3li) 


^sr  letting  -  ^  recognizing  that  dc^^  is  arbitrary,  we 

get 

t 

J, 

‘^^bV  ’  -^®^g  "  •••»  2,  ...,  N.  (2.35) 

Let  the  row  of  the  j^cj  matrix  be  the  vector  ,  Then  the  above 

equation  may  be  vritten  as  the  eigenvalue  problem  of  Equation  2.36  by 
recalling  that  u^^  -  u^. 


l]=  0,  for  X=  1,  2,  ...,  N. 


(2.36) 
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Solutions  of  Bquation  2.36  exist  only  for  N  specific  values  of 

The  vector  is  an  eigenvector  of  the  covariance  catrix  [uj ,  The  eigen- 
*  0, 
values  are  positive  and  the  corresponding  eigenvectors  are  orthogonal 

since  the  matrix  fuj  is  positive  definite.  Since  the  transfomation  ^cj 

is  to  be  TiOn-aingular,  the  different  rows  Cj  mist  correspond  to  different 

eigenvalues  of  [uj  .  It  nay  be  shewn  that  the  only  extreraun  of^  is  a 

aininun,  subject  to  the  constraint  of  Equation  2,25.  Thus  the  optinua  linear 

transfomation  which  nininiaes  the  man-square  distance  of  a  set  of  vectors 

••hile  keeping  the  volume  of  the  space  constant  is  given  by  Bquation  2.3?, 

where  rows  of  [cj  are  eigenveetors  of  the  covariance  matrix  [uJ, 


^L1*12‘***1N 

®21*22’***2N 


®11®12***®1M  ”ll^ 
®21®22***°2N 


^a*K2***®KN 


®La®fJ2**‘®NN 


''ll°ll  ^22°21  *•*  Vm 

^  "ll®12  ^'22°22  •**  WiB 

'lIN  WggCgjj  . . .  WjjjjCjjjj 

•  e 


(2.3?) 


The  numerical  value  of  the  minimum  mean-square  distance  may  now 

-y 

be  computed  as  follows.  The  quantity  D  was  given  in  Equation  2,2lic  which 
is  reproduced  here  as  Equation  2,38. 


WTF 


ft 

[p=i 


-^2N.(^)^/^  (2.38) 
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Substituting^  from  Equation  2,?9,  »e  obtain  Equation  2,?9. 

.  0  c 

^  -jpnpaJ 

But  fron  Equation  2.35  see  that  nin  D  nay  be  witton  as  belovi,  vhere 
the  constraint  2.25  has  also  been  utilized. 


(2.39) 


nin  D 


2  M 


‘'pn  J  "TTHT 


(2.I40) 


It  should  be  noted  that  the  constraint  of  Equation  2,?5  is  not, 

in  general,  a  constant  volune  constraint.  It  is  that  only  if  the 

transfomation  [cj  is  orthogonal,  as  is  the  case  in  the  solution  Just 

obtained.  The  set  of  transforaations  which  keeps  the  volume  constant  is 

T  in  Figure  5.  A  subset  of  these  are  the  orthogonal  transformations  T 
^  0 

of  constant  volume,  of  which  the  optimum  was  desired.  The  solution  presented 

here  found  the  optimum  transformation  among  a  set  of  T,  which  contains 

orthogonal  transformations  of  constant  volume  but  is  not  necessarily  constant 

volume  for  those  which  are  non-orthogonal.  The  solution  here  given, 

therefore,  is  optimum  among  the  constant  volume  transformations  T  H  T 

V  L 

shown  shaded  in  Figure  5.  This  intersection  is  a  larger  set  of  trans¬ 
formations  than  that  for  which  the  optimum  was  sought. 


Figure  5.  Sets  of  Transformations 
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Tl»  fBthoda  of  this  chapter  are  o?tir,al  in  neasuring  senbership 
in  categories  of  certain  typos.  Suppose,  for  irjftajice,  that  categories  are 
statistically^  independent  randen  procerses  vhich  ger»erate  r^rbors  'fith  nulti- 
variate  Gaussian  probability  distributions  of  unimovm  r-eans  and  variances. 
ELscid’.ere  it  is  shown  that  t3:o  notrlc  developed  Jiore  measures  contours  of 
equal  a  posteriori  probabilities.  Given  the  set  of  labeled  events,  the  notric 
specifies  the  locus  of  points  ^•h^ch  are  r-or.bcrs  of  t5»e  category  in  question 
ifith  equal  probability. 

Before  bringing  this  chapter  to  a  conclusion,  the  inportanfc 
concepts  introduced  hare  '^ll  be  sumarluod. 

Categorisation,  the  basic  probler.  of  pattern  recognition,  is 
regarded  as  the  process  of  learning  hov  to  partition  the  signal  space  into 
regions  vfh^ra  each  contains  points  of  onlj'  one  category.  The  notion  of 
sirdlarity  betv;een  a  point  and  a  set  cf  noints  of  a  category  plays  a  doninant 
role  in  the  partitioning  of  signal  space,  Sirdlarity  of  a  point  to  a  set  of 
points  is  regarded  as  the  average  "distance"  between  the  point  and  the  set. 

The  sense  in  which  distance  is  understood  is  not  specified,  but  the  optiraum 
sense  is  thought  to  be  that  wrdch  (by  the  optinun  r.ethcd  of  r.easuring  distance) 
clusters  most  hi^lO'  those  points  which  belong  to  the  same  category.  The  mean- 
square  distance  between  points  of  a  category  is  a  measure  of  clustering.  An 
equivalent  alt^-rnate  interpretation  of  sirularity  (not  as  peneral  as  the  inter¬ 
pretation  above)  is  that  the  transfor'.aticn  which  optimally  clusters  like  points 
subject  to  suitable  criteria  to  assure  the  non -triviality  of  the  transformations 
is  instrumental  in  exhibiting  the  similarities  between  points  of  a  set.  In 
particular,  the  optimum  orthogonal  transformation  and  hence  a  non-Euclidean 
method  of  measuring  distance  is  found  which  minimizes  the  mean-square  distance 
between  a  set  of  points,  if  the  volume  of  the  space  is  held  constant  to  assure 
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non-triviality.  The  resulting  nsasure  of  alnilarilgr  between  a  point  P 

and  a  set  fF^is  given  in  fi:]uatlon  2,1j1,  where  a  is  given  the  Theorea 
C  ly  *•" 

of  this  chapter. 


1 

R 


^ns^s  ”  ^RS^ 


] 


2 


e 


(2.U) 


To  facilitate  the  instrunentation  of  conputations  of  the  function  S,  a 

% 

typifying  exanple  E  of  the  set  is  developed  which  sets  an  upper  bound  on 
the  necessary  information  storage  at  2N,  nunbors,  whore  U  is  the  nuwber 
of  diesensions  of  the  space  in  which  the  points  are  represented. 


f 
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3.  CATJBGORIZATION 

3,1  The  PtocMS  of  Clmiflcation  , 

Pattern  recognition  consists  of  the  twofold  task  of  "learning", 
on  one  hand,  what  the  category  or  class  is  to  which  a  set  of  events  belongs, 
and  of  deciding,  on  the  other  hand,  whether  a  new  event  belongs  to  the 
category  or  not.  In  this  chapter  details  of  the  method  of  accoaplishing 
these  two  parts  of  the  task  are  discussed,  subject  to  the  linitatlons  on 
recognisable  categories  iaposed  by  the  assunptions  stated  earlier. 

In  the  following  section  two  distinct  nodes  of  operation  of  the 
recognition  systea  will  be  distinguished.  The  first  of  these  consists  of 
the  sequential  introduction  of  a  set  of  events,  each  labeled  according  to 
the  category  to  which  it  belongs.  During  this  perio<^  identification  of  the 
conaon  pattern  of  the  inputs  which  allow  their  classification  into  their 
respective  categories  is  desired.  As  part  of  the  process  of  learning  to 
categorise,  the  estioate  of  what  the  category  is  must  also  be  updated  to 
include  each  new  event  as  it  is  introduced.  The  process  of  updating  the 
estiaate  of  the  common  pattern  consists  of  recomputing  the  new  measures  of 
siailarity  and  the  typifying  examples  of  the  sets  so  that  these  will  Include 
the  new,  labeled  event  on  which  the  above  quantities  are  based. 

During  the  second  mode  of  operation  the  event  P  to  be  classified 
is  compared  to  each  of  the  sets  of  labeled  events  by  the  measure  of  simi¬ 
larity  found  best  for  each  set.  The  event  is  then  classified  as  a  member 
of  that  category  to  which  it  is  most  similar. 

It  is  not  possible  to  state  with  certainty  that  the  pattern  has  been 
successfully  learned  or  recognized  from  a  set  of  its.examples,  because 
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icfonaiioo  is  not  sTsilsble  on  how  esonples  wexe  selected  to  xepreseot  the 
clus.  Neyettbeless.  it  is  possible  to  obtain  a  qualitative  indication 
of  how  cettain  we  nay  be  of  having  obtained  a  coxxect  aethod  of  detexaining 
aeabexship  in  the  categoxy  fxoa  the  enseable  of  sinilax  events.  As  each  new 
event  is  introduced,  its  siailarity  to  the  cseabexs  of  the  sets  already  pre¬ 
sented  is  Measured  by  the  function  S  defined  in  the  preceding  chapter.  The 
nagnitude  of  the  nunber  S  indicates  how  close  the  new  event  is  to  those 
already  Introduced.  As  S  is  refined  and,  with  each  new  exanple  inproves 
its  ability  to  recognise  the  class,  the  nuncrical  neasure  of  sinilarity 
between  new  exanples  and  the  class  will  tend  to  decrease,  on  the  average. 
Strictly  speaking,  of  course,  this  last  statenent  cannot  be  true  in  general. 

It  nay  be  true  only  if  the  categories  to  be  distinguished  are  separable  by 

I 

functions  S  taken  froa  the  class  which  we  have  considered;  even  under  this 
condition  the  statement  is  true  only  if  certain  assuaptions  are  made  re¬ 
garding  the  statistical  distribution  of  the  samples  on  which  we  learn.  Since 
we  have  no  a  priori  knowledge  regarding  the  satisfaction  of  either  of  these 
two  requirements,  the  convergence  of  the  similarity  as  the  sanple  size  is 
increased  is  simply  qualitative  wishful  thinking  whose  heuristic  justifi¬ 
cation  is  based  on  the  minimization  problem  solved  in  developing  S, 

Figure  6  illustrates  the  mechanization  of  the  learning  and  recogni¬ 
tion  nodes  of  the  special  classificatory  process  discussed  so  far.  For  the 
sake  of  clarity,  the  elementary  block  diagram  of  the  process  is  shown  to 
distinguish  only  between  two  categories  of  events,  but  it  can  be  extended 
readily  to  distinguish  between  an  arbitrary  number  of  categories.  It  should 
be  noted  that  one  of  the  categories  may  be  the  complement  of  all  others. 
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The  ad»issiun  of  such  a  category  into  the  set  is  one  of  the  »ays  in  which 
a  nachinc  which  is  always  forced  to  classify  events  into  known  categories 
fay  be  nade  to  decide  that  an  event  does  not  belong  to  any  of  the  desired 


ones;  it  belongs  to  the  category  of  **cverythiRg  else".  Samples  of  "every¬ 
thing  else'*  Rust.  of  course,  be  given. 


Pigurc  6.  islencntary  Clock  Diajran  of  tlic  Classification  Process 
During  the  first  node  of  operatior. .  the  input  to  the  machine  is  a 
set  of  labeled  events.  Let  us  follow  its  behavior  through  an  cxanple.  Sup¬ 
pose  that  a  number  of  events,  some  belonging  to  set  A  and  some  to  sot  B,  have 
already  been  introduced.  According  to  tlie  method  described  in  the  previous 
chapter,  therefore,  the  optimum  metrics  (one  for  each  class)  have  been  found 
which  minimize  the  mean-square  distance  between  events  of  the  same  set. 
Similarly,  the  best  e.xemplars  of  the  sets  liave  also  been  found.  As  a  new 
labeled  event  is  introduced  (say,  it  belongs  to  set  A),  the  switch  at  the 
input  is  first  turned  to  the  recognition  mode  R  so  that  the  new  event  P  may  be 
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corspared  to  set  A  as  well  as  to  set  B  through  the  functions  • 

S^(P)  =  S(P.  «  S^(P,B^)  and  Sp(r)  which  were  conputed  before  the  intro- 

duction  of  P.  The  conpaxison  of  with  a  threshold  K  indicates  whether 

the  point  P  would  be  classified  correctly  or  incorrectly  from  knowledge  a- 
vailable  up  to  the  present.  Tl>e  input  switch  is  then  turned  to  A  so  that  P, 
which  indeed  belongs  to  A.  nay  be  included  in  the  conputation  of  the  best 
netric  and  exenplar  of  set  A. 

When  the  next  labeled  event  is  introduced  (let  us  say  it  belongs  to 
set  B) ,  the  input  switch  is  again  turned  to  R  to  test  the  ability  of  the 
nachine  to  classify  the  new  event  correctly.  After  the  test,  the  switch 
is  turned  to  B  so  that  the  event  nay  be  included  anong  the  examples  of  set 
B  and  the  optinun  function  S_  nay  be  reconputed.  This  procedure  is  repeat- 

O 

ed  for  each  new  event,  and  a  record  is  kept  of  the  rate  at  which  incorrect 
classifications  vrauld  be  nade  on  the  known  events.  Iflien  the  training  period 
is  completed,  presumably  as  a  result  of  satisfactory  performance  on  the  se¬ 
lection  of  knavn  events,  the  input  switch  is  left  in  the  recognition  mode. 

3.2  Learning 

"Supervised  learning"  takes  place  in  the  interval  of  tine  in  which 
examples  of  the  categories  generate  ensembles  of  points  from  which  the  de¬ 
fining  features  of  the  classes  ate  obtained  by  methods  previously  discussed. 
"Supervision"  is  provided  by  an  outside  source  such  as  a  human  who  elects  to 
teach  the  recognition  of  pattern  by  examples,  and  who  selects  the  examples, 
on  which  to  learn. 

"Unsupervised  learning",  by  contrast,  is  a  method  of  learning  without 
the  aid  of  such  an  outside  source.  It  is  clear,  at  least  intuitively,  that 
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the  unsupcrviaed  learain^  of  Mobership  in  specific  classes  cannot  succeed 
unless  it  is  preceded  by  a  period  of  supervision,  during  which  soae  concepts 
regarding  the  characteristics  ut  classes  are  established.  A  specified  degree 
of  certainty  concemii^  the  patterns  has  been  achieved  in  the  fom  of  a  suf¬ 
ficiently  low  rate  of  aisclassification  during  the  supervised  learning 
period.  The  achieveoent  of  the  low  nisclassification  rate,  in  fact,  can 
be  used  to  signify  the  end  of  the  learning  period,  after  which  the  system 
which  perforns  the  operations  indicated  in  Figure  6  say  be  left  to  its  own 
devices.  It  is  only  after  this  supervised  interval  of  tine  that  the  system 
may  be  usefully  employed  to  recognise,  without  outside  aid,  events  as  be¬ 
longing  to  one  or  another  of  the  categories. 

Throughout  the  period  of  learning  on  exacaples,  each  example  is  in¬ 
cluded  in  its  proper  set  of  similar  events  which  influence  the  changes  of  the 
measures  of  similarity.  After  supervised  activity  has  ceased,  events  intro-  • 
duccd  for  classification  may  belong  to  any  of  the  categories;  and  no  outside 
source  informs  the  machine  of  the  correct  category.  Die  machine  itself, 
operating  on  each  new  event,  however,  can  determine,  with  the  already  quali¬ 
tatively  specified  probability  of  error,  to  which  class  the  event  should 
belong.  If  the  new  event  is  included  in  the  set  exemplifying  this  class, 
the  function  measuring  membership  in  the  category  has  been  altered.  Un- 
supervised  learning  results  from  the  successive  alterations  of  the  metrics, 
brought  about  by  the  inclusion  of  events  into  the  sets  of  labeled  events 
according  to  determination  of  class  membership  rendered  by  the  machine  Itself. 

This  learning  process  is  instrumented  by  the  dotted  line  in  Figure  6  which, 
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when  the  learnine  switch  L  is  closed,  allows  the  «achine*s  decisions  to 
control  routing  of  the  input  to  the  various  sets.  • 

^  ^®^ilitate  the  illustration  ol^  soae  iaplications  of  the  process 
described  above,  consider  the  case  in  which  recognition  of  aeobersbip  in  a 
single  class  is  desired  and  all  the  labeled  events  are  nenbers  of  only  that 
class.  In  this  case,  classification  of  events  as  Benbers  or  nomenbers  of 
the  category  degenerates  into  the  conparison  of  the  sinilarity  S  with  a 
threshold  T.  If  S  is  greater  than  T,  the  event  is  a  nonaenber;  if  S  is 
less  than  I,  on  the  other  hand,  the  event  is  said  to  be  a  aenber  of  the  class. 
Since  the  oachine  decides  that  all  points  of  the  signal  space  for  which  S  is 
less  than  T  are  BCnbers  of  the  class,  the  latter,  as  far  as  the  csachine  is 
concerned,  is  the  collection  of  points  which  lie  in  a  given  region  in  the 
signal  space.  For  the  specific  function  S  of  the  previous  chapter,  this 

t 

region  Is  an  ellipsoid  in  the  N-dioensional  space. 

Unsupervised  learning  is  graphically  illustrated  in  Figure  7.  The 
two-diaensional  ellipse  drawn  with  a  solid  line  signifies  the  donain  of 
the  signal  space  in  which  any  point  yields  S  c  T.  This  donain  was  obtained 
during  supervised  activity.  If  a  point  is  introduced  after  supervised 
learning,  so  that  lies  outside  D^.  then  P^  is  merely  rejected  as  a  non¬ 

member  of  the  class.  If  point  P^  contained  in  is  introduced,  however, 

It  is  Judged  a  member  of  the  class  and  is  included  in  the  set  of  exaaples 
to  generate  a  new  function  S  and  a  new  domain  D^,  designated  by  the  dotted 
line  in  Figure  7.  A  third  point  P^  which  was  a  nonnember  before  the  intro¬ 
duction  of  P^  becomes  recognized  as  member  of  the  class  after  the  inclusion 
of  P  in  the  set,  of  similar  events. 
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Figure  7.  Onaupervised  Learning 

Although  the  tendency  of  this  process  of  "learning"  is  to  perpetuate 
the  original  donain.  it  has  interesting  properties  worth  investigating.  The 
investigation  of  unsuperviaed  learning  would  for*  the  basis  for  a  valuable 

continuation  of  the  work  presented  herein. 

Before  leaving  the  subject  of  unsupervised  learning,  it  should  be 

pointed  out  that  as  the  new  doaain  is  formed,  points  such  as  in 
Figure  7  become  excluded  from  the  class.  Such  an  exclusion  from  the 
class  is  analogous  to  "forgetting"  because  of  lack  of  repetition.  Forget¬ 
ting  is  the  characteristic  of  not  recognizing  as  a  member  of  the  class, 
whereas  at  one  time  it  was  recognized  to  belong  to  it. 

3.3  Threshold  Setting 

In  the  classification  of  an  event  P  the  mean-square  distance  between 
P  and  members  of  each  of  the  categories  is  computed.  The  distance  between 
P  and  members  of  a  category  C  is  what  we  called  "similarity",  S^,(P) ,  where 
the  "sense"  in  which  "distance"  is  understood  depends  on  the  particular  cate 
gory  in  question.  We  then  stated  that,  in  a  manner  analogous  to  decisions 
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based  on  naxiaua  likelihood  ratios,  the  point  P  is  classified  as  a  Maber  of 
the  category  to  which  it  is  nost  sinilar.  lienee,  P  belongs  to  category  C  if 
Sq(P)  aST  Sjj(P) ,  where  X  is  any  of  the  other  categories. 

Since  in  this  special  theory  the  function  S^(P)  which  aeasuses 
taeabership  in  category  C.  was  developed  by  aaxiiully  clusterii^  points  of  C 
without  separating  then  froo  points  of  other  sets,  there  is  no  guarantee, 
in  general,  that  a  point  of  another  set  B  nay  not  be  closer  to  C  than  to  B. 
This  is  guaranteed  only  if  points  of  the  sets  satisfy  certain  conditions  which 
will  be  stated  below.  A  graphical  illustration  idjich  clarifies  the  con- 
parison  of  sinilarities  of  a  point  to  the  different  categories  is  shown  in 
Figure  8.  In  this  figure  the  elliptical  contours  S.  (P) ,  S.  (P) ,  etc., 
indicate  the  loci  of  points  P  in  the  signal  space  which  are  at  a  nean-square 

distance  of  1,  2 . etc.,  fron  nenbers  of  category  A.  The  loci  of  these 

po.ints  are  concentric  ellipsoids  in  the  .\-dinensional  signal  space,  shown 
here  in  only  two  dineasions.  Sioilatly,  S  (P).  S  (P) .  etc.,  and 

1  “2 

Sq  (P) ,  Sp  (P),.,.,  etc.,  are  the  loci  of  those  points  whose  mean-square 
1  2 

distance  from  categories  B  and  C,  respectively,  are  1,  2 .  etc.  Note 

carefully  that  the  sense  in  v;hich  distance  is  measured  to  each  of  the  cate- 
gories  differs  as  is  indicated  by  the  different  orientations  and  eccentrici¬ 
ties  of  the  ellipses.  • 
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Figure  8,  Categorization 

The  heavy  line  shows  the  loci  of  points  which  arc  at  equal  mean-square 
distances  to  tv/o  or  more  sets  according  to  the  manner  in  which  distance  is 
measured  to  each  set.  This  line,  therefore,  defines  the  boundary  of  each  of 
the  categories. 

At  this  point  in  the  discussion  it  would  be  helpful  to  digress  from 
the  subject  of  thresholds  and  dispel  some  misconceptions  which  Figure  8 
might  create  regarding  the  general  nature  of  the  categories  found  with  the 
method  described  herein.  It  will  be  recalled  that  one  of  the  possible 
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wmys  in  which  a  point  not  belonging  to  eithex  category  could  be  so  classified 
ifss  by  allowing  a  separate  category  for  "everything  else*'  and  assigning  the 
point  to  the  category  to  which  its  Bean<^quare  distance  is  saallest.  Another, 
perhaps  nore  practical,  nethod  is  to  call  a  point  a  aeaber  of  neither  category 
if  its  neaa-square  distance  to  the  set  of  points  of  any  class  exceeds  sone 
threshold  value.  If  this  threshold  value  is  set,  for  exanple,  at  a  nean- 
square  distance  of  3  for  all  of  the  categories  in  Figure  8.  then  points 
belonging  to  A,  B,  and  C  will  lie  inside  the  three  ellipses  shown  in  Figure  9. 


Figure  9.  Categorization  with  Threshold 
It  is  readily  tseen,  of  course,  that  there  is  no  particular  reason  why 
one  given  ninimun  mean-square  distance  should  be  selected  instead  of  another; 
or,  for  that  matter,  that  this  ninimum  distance  be  the  sane  for  all  cate¬ 
gories.  Many  logical  and  useful  criteria  may  be  selected  for  determining  the 
optimum  threshold  setting.  Here,  only  one  criterion  will  be  singled  out  as 
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particularly  useful.  This  criterion  requires- that  the  ninlauB  thresholds 
be  set  so  that  cost  of  the  labeled  points  fall  into  the  correct  category. 

This  is  a  fundasental  criterion,  for  it  requires  the  systea  to  be  designed  to 
work  best  by  the  largest  nuaber  of  correct  decisions. 

The  criterion  of  selecting  -a  threshold  to  aake  the  aost  correct 
classifications  nay  be  applied  to  our  earlier  discussion  where  the  boundary 
between  categories  was  detemined  by  equating  the  siailarities  of  a  point 
to  two  or  more  categories.  In  the  particular  exiaple  of  Figure  6,  where  a 
point  could  be  a  nenber  of  only  one  of  two  categories  A  and  D,  the  difference 
S  -  S  ■  0  forwcd  the  dividing  line.  There  is  nothing  aagical  about  the 

n  O 

threshold  zero;  one  night  require  that  the  dividing  line  bctv#een  the  two 
categories  be  S^-  ■  K,  where  K  is  a  constant  chosen  from  other  considera¬ 

tions.  A  similar  problem  in  communication  theory  is  the  choice  of  a 
signal-to-noise  ratio  w.bich  serves  as  the  dividing  line  between  calling  the 
received  waveform  "signal**  or  calling  it  '*noise**.  It  is  understood,  of 

course,  that  signal-to-noise  ratio  is  an  appropriate  criterion  on  which  to 
« 

base  decisions  (at  least  in  some  cases),  but  the  particular  value  of  the  ratio 
^0  be  used  as  a  threshold  level  must  be  detemined  from  additional  require¬ 
ments.  In  communication  theory  these  ate  usually  requirements  on  the  false 
alarm  or  false  dismissal  rates.  In  the  problem  of  choosing  the  constant  K, 
we  may  require  that  it  be  selected  so  that  most  of  the  labeled  points  lie 
in  the  correct  category. 

3.4  Practical  Considerations 

In  considering  the  instrumentation  of  the  process  of  categorization 
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previously  described,  two  nain  objectives  of  the  wachine  design  aust  receive 
careful  consideration.  The  first  of  these  is  the  practical  xequireaent  that 

all  coisputatioas  involved  in  either  the  learning  or  the  recognition  aode  of 

0 

the  aachine's  operation  be  perforeed  as  rapidly  as  possible.  It  is  es¬ 
pecially  desirable  that  the  classif ication  or  recognition  of  a  new  event  be 
iapleaented  in  essentially  real  tiae.  The  inportance  of  this  requireaent 
is  readily  appreciated  if  the  classif  icatory  technique  is  considered  in  teras 
of  an  application  such  as  the  autonatic  recognition  of  speech  events,  an  ia- 
portant  part  of  voice  controlled  phonetic  typewriters.  The  second  major 
objective,  not  unrelated  to  the  first,  is  that  the  storage  capacity  required 
of  the  machine  have  an  upper  bound,  thus  assuring  that  the  machine  is  of 
finite  and  predetermined  sise.  At  first  glance  it  seems  that  the  instrumenta¬ 
tion  of  the  machine  of  Figure  6  requires  a  storage  capacity  proportional  to  the 
number  of  events  encountered  during  the  machine *s  experience.  This  seems  so 
because  the  set  of  labeled  events  on  which  the  computations  are  carried  out 
must  be  stored  in  the  machine.  It  will  be  shown  in  this  section,  however, 
that  all  computations  may  be  performed  from  knowledge  of  only  certain  sta¬ 
tistics  of  the  set  of  labeled  events,  and  that  these  statistics  may  be  re¬ 
computed  to  include  a  new  event  without  knowledge  of  the  original  set. 

Therefore,  it  is  necessary  to  store  only  these  statistics,  the  number  of 
which  is  independent  of  the  number  of  points  in  the  set. 

It  will  be  recalled  that  there  are  two  instances  where  knowledge  of 
the  data  matrix  is  necessary.  The  data  matrix  given  in  Equation  3.1, 
is  the  M  X  N  matrix  of  coefficients  which  results  when  the  M  given  examples 
of  the  same  category  are  represented  as  N-dimensional  vectors. 
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^11  ^12  *  •  •  ^IN 
^21  ^22  •  •  •  ^2N 


(3.1) 


^K2  •  •  * 

The  first  use  of  this  nstrix  occurs  io  the  cooputstion  of  the 
optimum  orthogonsl  trsnsfonutioo  or  metric  which  minimises  the  meso-squmre 
distance  of  the  set  of  like  events.  This  transformation  is  stated  in  the 
Theorem  in  Section  2.4  and  is  given  in  Equation  2.37  as  the  product  of  an 
orthonormal  and  diagonal  transformation.  Rows  of  the  orthonormal  trans¬ 
formation  are  eigenvectors  of  the  covariance  matrix  computed  from 
the  data  matrix  of  Equation  3.1,  and  elements  of  the  diagonal  matrix  [w J  are 
the  reciprocal  standard  deviations  of  the  data  matrix  after  it  has  been 
.transformed  by  the  orthonormal  transformation  • 

The  second  use  of  the  matrix  occurs  when  an  unclassified  event  P 

is  compared  to  the  set  by  measuring  the  mean-square  distance  between  P  and 

♦ 

points  of  the  set  after  both  the  point  and  the  set  have  been  transformed. 

This  latter  comparison  is  replaced  by  the  measurement  of  the  distance  between 
the  transfdrmed  point  P  and  a  "typical  example"  of  the  set,  as  stated  by 

a 

Equation  2.19.  The  quantities  of  interest  in  this  computation,  as  seen 
from  Equation  2.21,  ate  the  mean,  the  mean-square,  and  the  standard  deviation 
of  the  elements  in  the  columns  of  the  data  matrix  after  the  orthonormal 
transformation. 
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Reduction  of  the  necessary  storage  facility  of  the  eacbine  nay  be 
accoaplished  if  only  the  covariance  natrix.  the  neant,  the  aean-squores,  and 
the  standard  deviations  of  the  transforaed  data  natrix  are  used  in  the  conputa* 
tions,  and  if  these  cay  be  reconputed  without  reference  to  the  original  data 
natrix.  The  expression  of  the  above  quantities  when  based  on  «♦!  events 
nay  be  conputed  fron  the  corresponding  quantity  based  on  M  events  and  a 
complete  knowledge  of  the  M^lst  event  itself .  The  method  of  the  computa¬ 
tions  is  described  below. 

(1)  The  covariance  natrix  of  M+1  events. 

The  general  coefficient  of  the  covariance  natrix  [u]  of  the  set  of 
events  given  by  the  data  natrix  is  given  in  equation  3.2 


u  »  u 
ns  sn 


rr  -TT 


ns  r  s 


(3.2) 


Note,  incidentally,  that  the  natrix  bl  nay  be  written  as  in  Equation  3.3, 
where  the  natrix  |jj|  has  been  introduced  for  convenience.  As  a  check,  let 
us  compute  the  general  element  u^^^ 
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The  product  is  the  covariance  natrix  coefficient  u 


|(^ln“  •  (*^2n“  ^2V  •  ’  (W 


.  f,  -T 

Is  s 

•  ^ls‘^s 


-  r  ^  -7^  =r7  -  r  r 

M  ^  ^  nn  ^Vns  sj  ns  ns 


(3.4) 


(3.5) 


Now  to  conpute  the  covariance  based  on  M-*l  events,  u^^(M+l) , 
it  is  convenient  to  store  the  N  means  for  all  values  of  n.  It  is  also 


convenient  to  store  the  N(N*l)/2  independent  values  of 


Doth  of 


these  quantities  may  be  updated  readily  as  a  new  event  is  introduced.  The  mean 
M^l 

7”  based  on  M+1  events  nay  be  obtained  froa  the  mean  based  on  only  M 

events,  f  ,  from  Equation  3.6a  and  f  f  nay  be  obtained  from  Equation  3.6b. 

n  ns 


M*1  M  T~  ♦  f,,  , 

-  n  M»1 , n 


(3.6a) 


M+1  M  Tnr  +  f, 


n  s  M+l,n  M+l,s 
M+1 


(3.6b) 


Here,  the  superscript  of  the  ensemble  average  indicates  the  number  of  events 

partaking  in  the  averaging,  and  f^j^^  n  coefficient  of  the  M+lst 

event.  We  now  have  everything  necessary  to  compute  the  new  covariance 

coefficients.  The  storage  facility  required  thus  far  is  N(N'*3)/2*1  locations. 

The  +1  is  used  for  storing  the  number  M.  If  the  covariance  matrix  is  also 

2 

stored,  the  necessary  number  of  storage  locations  is  (N+1)  ;  this  makes  use 
of  the  fact  that  both  and  are  symmetric  matrices. 
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FrcB?  the  ruLtrix  M  the  orthonornal  cror.sforsatloa  10  nay  be  found 
by  solving  the  eigenvalue  problcn  [c^lu  -Xl}«  0.  The  natt£x  has  to 
b«  stored,  requiring  an  additional  I» '  storage  locations* 

(2)  Mean  of  the  colusai  of  the  ^1*J  «  I'p'J  . 

As  stated  earlier,  one  of  the  quantities  of  interest  in  the  typifying 
example  is  the  ncan  of  the  clenents  in  a  colunn  of  the  data  natrix  after  its 
orthonornal  transforaation  with  [cj  .  The  general  cleaent  of  the  ^F'J 
natrix  is  given  in  iiTjation  ?..26b  jnJ  in  3.7a.  arjd  its  nean  is  given 

in  Equation  3,7b). 


(3,7a) 

(3.7b) 


No  additional  storage  is  required  to  conputo  f^*.  since  all  the  factors  of 
Equation  3.7b  arc  already  known.  A:i  additional  locatio.ns  must  be  nadc 
available  to  store  the  N  ncans,  liowcv’cr, 

(3)  Mean-square  of  p  colunn  of  ;  I'-'J  , 

The  mean-square  value  of  cleno-.ts  of  the  p**’  colunn  of|Fj 
is  given  in  Equation  3.8a  and  b. 

~2  1  ’•*  2  1  — 

^  ^np  M  2—  ^  ^>"0  ^ns*^pn  ^ps  (3.8a) 

n=l  ra=l  nal  s=T 

9 

2  N  N 

%  '1.  H  ^3  'pn  =ps  <3.8b) 

n=l  s=l 
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No  additional  storage  is  necessary  for  this  conputation.  An  additional  . 

2 

N  locations,  howewr,  nust  be  available  to  store  . 

(d)  Standard  deviation  of  the  col  inn  of 

The  only  renaining  quantity  necessary  in  the  instrunentation  of  the 
recognition  systen  is  the  reciprocal  standard  diviation  of  the  p  colunn 
of  ,  as  stated  in  the  thcoren  of  Chapter  2.  The  standard  deviation  and 

the  elenents  of  the  diagonal  natrix  C<]  arc  given  by  liquation  3.9,  where  all 
the  quantities  are  already  known.  An  additional  N  locations  arc  needed  to 
store  their  values,  however. 


oc  Jl _  s 

<r 


(3.9) 


The  total  nunber  of  storage  locations  is  2.'r  ■»  5N’*1  for  each  of  the 
categories  to  which  events  nay  belong..  If  the  nunber  of  exanples  JI  of  a 


category  is  less  than  the  nunber  of  dinensions  M  of  the  space  in  which  they 
arc  represented,  the  required  nunber  of  storage  locations  is  only  +  S*l+1. 
In  order  to  utilize  this  further  reduction  of  storage  and  coaputational  time, 
however,  the  .M  events  must  be  rccxprcsscd  in  a  new  coordinate  systen  obtained 


through  the  Schmidt  of thogonalization  of  the  set  of  M  vectors  representing 
the  examples  of  the  set.  In  the  hp^jnning  of  the  learning  process,  when  the 
number  of  labeled  events  is  very  much  smaller  than  the  nunber  of  dimensions 
of  the  space,  the  saving  achieved  by  Schmidt  orthogonalization  is  very  sig¬ 
nificant. 
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A  practical  recautk  worthy  of  nention  is  tlvat  at  the  beginning 
of  the  learning  process  •  when  M  is  less  tijan  5«i  the  solutioo  of  the  eigeo» 
waluo  problen  |u  -  xx]  •  0  nay  be  greatly  siopllfied  by  recognition  of  the 
fact  that  singular  if  M<N.  Although  it  is  not  inaedUtely  obvious, 

nevertheless,  it  is  true  that  the  nonzero  eigenvalues  of  fujin  liquation 

...  X 

3.3a  are  identical  to  the  eigenvalues  of  the  natrix  (F  -  J/'P  -  J;  as 
stated  below. 

Kon-xero  eigenvalues  of  (P  -  J)  (P-j)  •  eigenvalues  of  fP  -  JjiF  -  (3.10) 

The  first  of  the  tsatriccs  is  an  x  N,  while  the  second  is  an  i:  x  M  eatrix. 

There  are  N«+l  zero  eigenvalues  of  the  first  natrix;  the  coaputational  ad¬ 
vantage  of  working  with  the  second  natrix  for  M<:«  is  therefore  significant. 

A  few  additional  raaarks  should  be  cade  about  the  nature  of  the  solution 
obtained  with  the  two  constraints  of  Equations  2.12b  and  2.13  .  It  should  be 
noted,  first  of  all,  that  if  the  nunber  of  points  in  a  set  is  equal  to  or  less 
than  the  nunber  of  dimensions  in  which  they  arc  expressed,  then  a  hyperplane  of  one 
less  dimensions  can  al\»ys  be  passed  through  the  points.  Along  any  direction 
orthogonal  to  this  hyperplane,  the  projections  of  points  of  the  set  F  are  equal. 
Along  such  a  direction,  therefore,  the  variance  of  the  given  points  is  zero, 
leading  to  a  zero  eigenvalue  of  the  covariance  natrix.  This  results  in  call¬ 
ing  the  corresponding  eigenvector  (the  direction  about  which  the  variance  is 
zero),  an  "all  important"  feature.  The  feature  weighting  coefficient  is 
thus  unity  or  infinity,  depending  on  which  of  the  above  two  constraints  were 
applied.  If  the  second  or  constant  volune  constraint  were  used,  each  point  of 
the  set  F  used  in  learning  would  be  correctly  identified,  and  its  distance  to 
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the  set  f  irauld  be  zero  by  the  optinun  Rctric.  At  the  sane  tine  the  netrlc 
classifies  each  point  of  another  catcyocy  G  as  a  nonnenber  of  F.  A  new 
RCnber  of  cateyory  F,  on  the  other  loiid,  probably  be  nisclassificd,  since 

it  is  ualihely  that  the  new  renter  of  F  iC'sjlJ  save  exactly  the  sane  projection 
aloiiy  the  oiycnvcctor  as  the  other  renters  lv»d  displayed.  Tlus  niscla:>;:ifi- 
cation  v;ould  not  occur  if  the  nur.bcr  of  eranplCi  of  the  category  F  cy-:codcd 
the  niinbcr  of  dinensioas  in  i.'hich  Vv.c-y  c':prcsscd.  There  are 
several  nethods  to  prevent  nisclassif icarior.;  for  exanplc,  if  the  first 
constraint  were  applied,  nisclassif icaf ion  r.(  nenbers  of  F  vitjuld  not  occur. 

Another  fact  of  soae  inportance  wiiic.'.  should  be  brought  to  the 

reader’s  attention  is  the  physical  significance  of  the  eigenvectors.  The 

vector  v;ith  the  snallest  eigenvalue  or  largest  feature  weighting  coefficient 

designates  tliat  feature  of  nenbers  of  tJu  set  ig  which  the  nenbers  arc  nost 

sinilar.  This  is  not  equivalent  to  the  feature  •fhich  is  nost  sinilar 

nenbers  of  the  set.  The  former  is  a  solution  of  a  pioblcn  in  which  wc  v;ish 

to  find  a  direction  along  which  the  ptoje.  tions  of  the  set  on  the  average, 

arc  nost  nearly  the  sane.  Tli8  second  a  solution  of  a  problcn  •..'her*-'  we  wish 

/ 

to  find  the  direction  along  v/liich  the  projections  of  tlie  set  arc  largest,  on 
the  average.  The  desired  direction,  in  the  first  rase,  is  tho  eigen¬ 

vector  of  the  covariance  matrix  with  *hc  snHllest  eigenvalue;  in  the  second 
case,  it  is  the  eigenvector  of  the  corrtlation  matrix  J v/ith  the  largest 
eigenvalue.  It  can  be  shov/n  that  the  latter  problem  is  equivalent  to  find¬ 
ing  the  set  of  orthonormal  functions  in  v/’nirn  a  process  is  to  be  expanded  so 
that  the  truncation  error,  whicli  results  when  onJ.y  a  finite  number  of  terms 
of  the  expansion  are  retained,  should  be  minimized,  on  the  average.  The  set 
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of, functions  bavins  this  property  are  eigenfunctions  of  the  correlation 
function  of  the  process,  and  they  are  arranged  in  the  order  of  decreasing 
eigenvalues. 

Hie  laoertant  concents  of  this  chanter  will  now  be  sunaarlxed. 
Pattern  recognition  consists  of  the  twofold  tast  of  "learning**,  on  the  one 
hand,  what  the  category  is  to  which  a  set  of  events  belongs;  and  of  deciding 
on  the  other  hand,  whether  a  new  event  belongs  to  the  category  or  not, 
"Learning",  for  the  sinple  situation  where  sinilarity  to  a  class  of  things 
is  deternined  solely  fron  exanples  of  the  class,  nay  be  instrunented  in  the 
form  of  the  diagran  of  Figure  6.  In  this  diagraw  "learning"  consists  of  the 
construction  of  Metrics  or  the  developnent  of  linear  transfornations  which 
■axinize  the  clusterlrg  of  points  which  represent  sinilar  events.  A  distinc¬ 
tion  is  nade  between  "supervised  learning”  (learning  on  known  exanples  of 
the  class)  and  "unsuoervised  learning"  (learning  through  use  of  the  Machine's 
own  experience).  In  this  connection  it  is  stated  that  the  convergence  of  a 
•  learning  process  to  correct  category  recognition,  in  most  cases,  probably 
cannot  be  giaranteed.  The  problen  of  threshold  setting  for  partitioning  the 
signal  space  is  likened  to  the  sinilar  problen  in  the  detection  of  noisy 
signals,  and  nay  be  solved  as  an  extrenun  problen.  Finally,  sone  practical 

f 

considerations  of  inportance  in  the  mechanization  of  the  decision  process 
are  discussed.  It  is  shown  that  only  finite  storage  capacity  is  required  of 
the  Machine  which  instrunents  the  techniques,  and  that  the  amount  of  storage 
has  an  upper  bound  which  depends  on  the  number  of  dimensions  of  the  signal 
space. 
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li.  DSCISIOy  THEORETICAL  BASIS  OF  CATEGORIZATION  TECHNIQUES 

Tht  catQ|;orl nation  tachniques  outlined  in  the  preceding  sectione 
do  not  involve  aseunptions  about  the  probability  distribitions  of  the 
respective  categories.  It  is  instructive,  hovever,  to  consider  the  relaticm 
of  these  techniques  to  the  conventional  decision- theoretical  approach  to 

O 

problems  of  categorisation.  In  the  latter  it  is  assumed  that  probability 
density  functions  for  each  category  arc  known.  Given  such  functions,  it 

o 

is  possible  to  set  up  optinuin  procedures  for  categori7.ation.  It  will  bo 
shown  that  under  certain  conditions  the  criteria  developed  in  this  report 
are  exactly  those  prescribed  by  decision  theory >hen  the  distributions  are 
known.  The  important  fact  that  should  be  kept  in  mind  is  that  the  categori¬ 
zation  techniques  discussed  in  earlier  sections  do  not  require  knowledge 
of  the  density  functions.  They  provide  procedures  of  categorization  where 
there  is  no  knowledge  of  such  distributions. 

The  purpose  of  this  section  is  to  provide  a  corroboration  of  the 
techniques  and  to  lay  bare  their  relation  to  decision  theory  proper. 

That  such  a  corroboration  should  occur  so  fortuitously  after  the  development 
of  the  techniques  is  gratifying  In  that  it  gives  support  from  a  well- 
established  mathematical  theory. 

In  order  to  set  down  the  relation  between  the  categorization 
techniques  of  earlier  chapters  and  decision  theory,  it  will  be  necessary  to 
state  briefly  some  of  the  assumptions  and  results  of  the  latter.  These 
results  will  be  stated  without  proof  since  their  full  exposition  may  be  found 
in  any  text  on  decision  theory. 
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For  purpo869  of  ®xpo8lt>ionj  two  categories  vlll  b®  treated* 
Generalization  to  K  categories  proceeds  In  a  natural  vay,  hut  tends  to 
obscure  the  essential  siKpllcity  of  the  theory.  Assune,  then,  that  there 
are  two  categories  to  which  it  is  desired  to  assign  objects  as  yet  xin- 
categorized.  The  only  direct  knowledge  available  about  a  specific  object  is 
a  set  of  n  measureirents  made  upon  It,  Purthomore,  a  probabili'l^  density 
function  for  each  cateogry  is  known  such  that,  when  integrated  over  a 
region  A  of  the  n-diriensional  space  spanned  by  the  in  taeasurenents,  it 
yields  the  probability  that  an  object  fron  a  given  cateogry  will  produce 
neasurerr.ents  falling  in  region  A.  That  is,  the  probability  that  an  object 
from  category  is  acconpanied  by  n  neasurerjents  that  fall  in  A  is  given  by 

^  Pi(x)  dx 
A 

where  Pj^(x)  is  the  probability  density  function  for  cateogry  and  x 
represents  the  vector  (Xj^,  Xg,  ...»  x^). 

Let  it  also  be  assured  that  a  priori  probabilities  and 

S 

are  known  which  give  the  probability  of  occurrence  of  an  object  from 
C],  and  C2,  respectively.  The  decision  theory  approach  Involves  dividing 
the  n-dlmensional  space  into  two  regions,  and  R^,  such  that  when  a  set 

of  measurements  falls  in  Rj^  the  object  is  assigned  to  and,  similarly, 
when  the  measurements  fall  in  R2,  the  object  is  assigned  to  C^. 

If  the  a  priori  probabilities  and  the  density  functions  are  known, 

then  these  regions  may  be  chosen  in  such  a  way  that  the  expected  cost  of 

/ 

making  decisions  is  minimized.  Here,  it  is  assumed  that  there  is  a  cost 
connected  with  making  a  miaclassification.  A  division  of  n-dimensional  space 
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Into  two  reglKis  and  R2  la  called  a  decision  procedure, 
cost  may  be  written 


E(K) 


"iXsi 


Pj(x)  dx  ♦  «2*^12 


PgCx)  dx, 


The  expected 

(li.l) 


where  ie  the  cost  of  miaclassifying  an  object  from  and  K^g,  that  of 
rdsclaself^'^ing  an  object  fron  C2.  The  first  tern  of  Equation  (ii.l)  is  the 
expected  cost  due  to  nisclaasiiying  objects  from  C^.  Since  p^Cx)  is  the 
density  function  for  C^,  its  integration  over  R2  (the  region  where  the 
procedure  specifies  that  the  object  be  assigned  to  Cg)  gives  this  expected 
cost.  A  similar  statement  nay  be  made  for  the  second  tern  of  Equation  (ii.l). 
Hence  (Ij.l)  gives  the  total  expected  cost. 

It  Is  desired  to  choose  the  regions  and  R2  that  ninintze  Equation 
(U.l).  To  determine  these  regions,  we  rewrite  Equation  (U.l)  in  the 
following  manner. 


E(K) 


^23  ~  "2  ^12 


^12  P2(^) 


dx 


(a.2) 


The  last  term  of  Equation  (h.2)  is  a  positive  number.  Consequently, 

Equation  (U.2)  is  made  smaller  by  choosing  the  region  R2  so  that  it  contains 
all  (and  only)  those  points  x*such  that 


"1  ^^21  Pl^*)  ■  "2^12  P2(^)  ^0' 

(U.3) 

Thus,  Rj^  must  be  the  region  of  points  which  satisfy 

Pl(x)  -  ^2  ^12  P2^^^ 

<U.U) 

• 

• 

t 

• 

• 
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Another  viy  of  writing  Equetione  (1.3)  wd 


*1’  5^ 


> 


<  "2 1^12 

Pot*)  "1  ^21 


(U.5) 


(U.6) 


The  optimum  decision  procedure  when  a  priori  probabllitiee, 
and  ng,  and  density  functions  are  known  is  given  by  Equations  (U.$)  end 
(U.6).  That  is,  given  a  sst  of  neasurenents  x  the  object  represented  by  x 
is  assigned  to  or  Cj  depending  on  whether  x  satisfies  Inequality  (U.5)  or 

(U.6). 

Unfortunately,  the  a  priori  probability  of  occurrence  of  an  object 
from  a  specified  sst  is  seldom  known.  In  lieu  of  these  probabilities  there 
are  procedures  which  permit  determination  of  the  regions  end  Rg.  Thus, 
ons  might  assume  ths  a  priori  probabilities  to  bo  equal.  This  is  known 
as  a  Mxlmum  likelihood  criterion.  Another  criterion  is  to  minimise  the 

f 

maximum  probability  of  misolasslflcatlon.  This  Is  the  "mlnimax"  criterion. 
It  is  obtained  by  choosing  the  regions  in  such  a  manner  that  the  03q)ectod 
cost  of  misclassifying  an  object  from  Is  equal  to  that  of  misclassifying 
an  object  from  C2,  i.e., 


*12  ^  •  “21  ^ 


VJe  next  consider  x  distributed  normally  with  mean  and»co- 
variancs  whsn  it  is  a  member  of  C^,  and  moan  P2  “"d  covariance  0^  when 
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it  la  ■  Barber  of  Cg.  That  is,  Pj^(x)  la  given  by 

"  7  (*  - 

aaul  P2(x),  by 

•  g  (=<  "  l^g)  (x  -  Hg)' 


(li.7) 


(h.B) 


The  regions  and  Rg  as  given  by  Equations  {li.5)  and  (Ji,6)  are 

1 1/2 


iL j  _  K 

|U^ 


IT? 

1/2 


exp  -  5  (x-u)  U‘^  (x-u^l  ♦  I  (x-Mg)  Ug^  (x-Ugk  "2*^12 


lu,  Pit'')  .  N"-'"  exp  -  i  <=‘-Pa>‘i'  ‘  5  <*rP2)  t«-P2)<^ 

p;i«7  |7|i75  Vji 

•  Pl(x) 

since  the  logarithmic  function  la  monotonlcally  Increasing,  the  ratio  p’J’^  "®y 

be  replaced  by  its  logarithm,  l.e,, 

r  1  '  '  -1  '1^  "pHn 

H,!  log  I  ^1  .  i  ('-Pi’  V  t^-Pl)  •  <’'-P2>  “2  (='-“2)J  7^  (>■•’) 


“2!^'^^  if,  -1  '  1  '1  V12 

«2‘  log  -  -  7  (x-tt^)  -  (x-p^)  U^'l  (x-Pg)  J<log  (U.IO) 


^ll 


It  vill  now  be  shown  that  the  regions  expressed  in  inequalities  (4.9)  and 
(U.IO)  are  the  sane  as  those  developed  In  the  preceding  sections  for  the 
categorization  of  unlabeled  objects.  First,  let  It  be  noted  that 
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*i'  -Aj 


vhere  is  the  natrix  ihe  rows  of  which  are  the  eigenvectors  of  and  ^ 
is  the  diagonal  nat:*lx  of  eigenvalues.  Tnen 

^1  *1  1 

and  , 

Uj-l  .  Aj  Aj-1  ij 

Likewise,  when  A^  and/lj  are  the  rjitrlces  or*  eigenvectors  and  eigenvalues  of  Uji 

-  A2'A2'^  Aj  . 


Hence, 


r 

“  7  LI-j^  ^  (x-Uj)  -  (X-M2)  LJ2  ^  ^ 

-  )  Aj^  A  ^A^  (x-Wj^)  -  (x-Up)  ^A,^  J 

.  ^  |^(x  Ai'-  U^A^')Af^(x  A^'-  UjA,’)'.  (,  u^A2^)A2'^(x  A2^- 


It  has  been  shovr.  In  the  preceding  sections  that  the  transfornation 
which  minimizes  the  mean-square  distance  cf  t!.e  first  category  when  volume  is 


held  invariant  is  given  by 


y  -  X  A  • 


(L.12) 


Similarly,  the  transformation  that  minimizes  the  mean-square  distance 


of  the  second  category  is 


y  ■  X 


a,'A  -1/2. 


'2  '2 


(U.I3) 
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The  funotion,  ea  preacribed  bgr  the  teohniqnee  mentioned  in  thia  report,  that 
neaauree  the  aialleri^  of  e  point  x^,  ea  yet  nncetegoriied,  to  the  eetefory 
ia  the  heen*equere  dietaaoe,  after  tranaformation,  of  tha  unlahelad  point 
to  the  pointe  of  oatefoiy  Cj^,  Thia  mean-eqpare  diatanoe  *7  be  eritten 


vhere  the  a^'e  are  tha  elgenreotore  of  and  the  ^^'a  are  the  eigenraluea 
of 

likewiae  the  nean-equaro  dtatance  of  x  to  tha  points  of  Cg  ia 


giren  b7 


where  tha  b*^a  are  the  eigemrectors  and  the  ®re  the*  eigenraluea 


(ii.35) 


of  A-. 


The  decision  procedure  whereby  the  point  x  is  assigned  to  or 
Cg  oonaists  of  obserrlng  whether 


(iia6) 


where  K  is  a  number  ohoeen  to  satisfy  some  criterion,  (e.g«,  minimitation 
of  the  false  diamisaal  rate}* 

It  will  be  shown  that  the  regdona  defined  by  (14.16)  and  (li.l7)  are 

e 

the  same,  except  for  nddltive  oonatants,  as  those  defined  by  (li*9).and  (l4*10)* 
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%  first  observe  that 


^  ^  ^  ^  ■  *  ***’lV'l  *  Vl>*] 

•  ^  ^  [Tx*a‘^)^  -  2  x*a'j  »  •  0'»18) 

where  the  averaging  takes  place  over  the  category  Cj, 

Next,  observe  that  the  Tariance  of  the  rotated  coordinatee  xa'^ 
is  given  hy  the  eigenvalue  That  is, 

—  2  "  3  2 

-Xj,  (i,.19) 

2 

Adding  and  subtracting  xa'^  within  tha  brackets  of  (Ij.lSXand  enploylng  (Ij.l9), 
ve  obtain  • 


.n.^^(xV,-snp;. 


a.20) 


Siallarly,  the  mean-equare  distance  of  x*  to  the  points  of  C,  may  be  written 


^diere  the  averaging  takes  place  over  the  points  of  C-, 

If  we^  denote  the  vectc^  the  conroonents  of  which  are  the  means 
of  the  components  of  the  category  by  Hj^and  the  corresponding  vector 
for  Cg  by  (ig,  then  the  regions  (l4»16)  and  (Ij.l?)  may  be  retritten  as 


a.2i) 
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^  L  ,xV,  -  4  (xVi  -  ■! 

ft.l^) 

ti  t  ‘“’i  ■  ■  ti  k  *• 

(li.l7a) 

Us  next  observe  that 

^  (xV^  -  •  (xV^Ai'  ?  -  HiA'i  Al'  ^  ^^***’l 

•  (x*  -  Aj^Cx  -  p^) 

(1.22) 

and 

(x*b'j^  -  •  Oc  -  I*2^^'2^2  ^2^*  * 

a».23) 

-It 


It  has  been  obeerred  ebore  that 

-1 


and 

•  • 

Henoe,  the  regions  (li.l6a)  and  (Ij.lTa)  nay  be  written 
(X*  -  ^)0i-^(x'‘  -  -  u*  - 

(x*  -  |ij)ni'’’(x*  -  nj)'  -  (>t*  -  -  sj)  <  *• 

Thai*  regtone  are  the  same  ae  those  of  (li.9)  and(li.lO)  when  the  constant 
tern  K  Is  ohoaen  properly.  Varloiia  oholoee  of  K  reflect  the  criterion 
enployed  in  the  deolelon  procedure.  Some  of  these  have  been  mentioned  above 
(e.f.i  najdiiMn  likelihood,  nlnimax). 
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Tiio  coirespondence  of  tho  two  decision  procedures  is  quite  informative. 
In  order  to  arrive  at  tho  decision  theory  solution^  it  was  necessary  to  know  ntbe 
density  functions  of  the  categories.  Knowledge  of  the  a  priori  probabilities, 
althougli  not  indisponsible,  was  an  essential  part  of  tho  reasoning.  Tho 
techniques  presented  in  this  contract  allow  a  procedure  iAon  neither  of  these 
factors  is  known. 
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5,  CATasqtniLTION  BY  SBBMUTIOW  OP  OASSBS 
5.1  Optiwiftion  Criteria 

The  central  concept  of  the  special  theory  of  siailarity  described 
In  the  preceding  chapters  is  that  nonidentical  events  of  a  coiMon  category 
■ay  be  considered  close  by  so«e  Method  of  neasuring  distance.  This  Measure 
of  distance  is  placed  iif  evidence  by  that  transforaation  of  the  signal  space 
which  brings  together  like  events  by  clustering  then  nost.  In  this  special 
theory  no  serious  attenpt  has  been  nade  to  assure  that  the  Metrics  which  were 
developed  should  separate  events  of  different  categories. 

The  purpose  of  this  chapter  is  to  introduce  criteria  for  developing 
optisuB  Metrics  and  transforoations  which  not  only  cluster  events  of  the  sane 
class' tut  also  separate  those  which  belong  to  different  classes.  Consider, 
for  exsaplf,  the  transforeation  which  naxinizes  the  nean-sguare  distance 
between  points  which  belong  to  different  classes  while  it  nininizes  the 
Bean^sguare  distance  between  points  of  the  sane  class,  The  effect  of  such  a 
transforoation  is  illustrated  in  Figure  XO  where  like  events  have  been 

• 

clustered  through  nininization  of  intraset  distances  and  clusters  have  been 

a 

separated  fron  each  other  through  the  maximization  of  interset  distances. 

The  transformation  which  accomplishes  the  stated  objectives  can  be  specified 
by  the  following  problems. 

Problem  1 

Find  the  transformation  T  within  a  specified  class  of  transformations 
which  sttKimizes  the  mean-sguare  interset  distance  subject  to  the  constraint 
that  the  sun  of  the  nsan*sguare  interset  and  intraset  distances  is  held  constant. 


6B 
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Figure  1(X  Separation  of  Classes 


Note  that  for  the  sake  of  siaplifying  the  natheMtics,  the  aini- 
aisation  of  intraset  distances  was  converted  to  a  constraint  on  the  MxiBiza- 
tion  problen.  If  interset  distances  are  aaxinized,  and  the  sun  of  inter  and 
intraset  distances  is  constant,  then  it  follows  that  intraset  distances  are 
■inieized.  We  eiay  iopose  the  additional  constraint  that  the  nean-square 
intraset  distance  of  each  class  is  equal,  thereby  avoiding  the  possible 
prrferential  treatment  of  one  class  over  another.  Without  the  latter  constraint 
the  situation  indicated  with  dotted  lines  in  Figure  10  may  occur  where  ninini- 
zation  of  the  sun  of  intrasct  distances  may  leave  one  set  more  clustered  than 
the  other.  ’ 

The  above  criterion  of  optimization  is  given  as  an  illustrative 
example  of  how  one  My  convert  the  desirable  objective  of  separation  of 
classes  to  a  matheMt ically  expressible  and  solvable  problem.  Several  alter¬ 
nate  ways  of  stating  the  desired  objectives  as  well  as  choosing  the  constraints 
are  possible.  For  example,  the  mean-square  intraset  distance  could  be 
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oininiicd  while  holding  the  interset  distance*  constant. 

The  optiaisstion  criterion  Just  discussed  suggests  a  different  block 
diagran  for  the  process  of  categorisation  than  that  showij  in  Figure  6.  Here 
only  a  single  tranafoniation  is  developed,  resulting  in  only  a  single  netric 
with  which  to  neasure  distance  to  all  of  the  classes.  The  classification  of 
an  event  P  is  accoaplished,  as  before,  by  noting  to  which  of  the  classes  the 
event  is  fiost  ainllar.  The  only  difference  is  that  now  siailarity  to  each 
elnsa  is  neasured  In  the  sane  sense,  in  the  sense  exhibited  by  the  trans” 
formation  which  Baxinally  separated  events  of  different  categories,  on  the 
average . 

PfoOlg*  2 

A  second,  even  sore  interesting  criterion  for  optiaua  categorisation 
is  the  optiaization  of  the  clasaificatory  decision  on  the  labeled  events. 
Classificatcry  decisions  are  ultioately  based  on  coopting  the  siailarity  S 

(nean-square  distance)  of  the  event  F  *ith  the  known  events  of  each  class. 

« 

r 

If  P  is  chosen  as  any  aeober  of  Class  A,  for  exaaple,  we  would  like  that 

average,  where  is  the  set  of  known  members 
of  any  other  Class  B.  Similarly,  if  P  is  any  member  of  B,  then 
sjp,|B^^j  *  sjp,  The  two  desirable  requirements  are  conveniently 

combined  in  the  statement  of  the  following  problem. 

Find  the  metric  or  transformation  of  a  given  class  of  transformations 
which  maximizes  s(p,j^Bj^^]  -  sjp,  on  the  average,  if  P  belongs  to 

Category  A,  while  requiring  that  the  average  of  -  sjp,^B^^|for  any 

P  contained  in  Category  B  is  a  positive  constant.  The  constraint  of  this 
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ptobKn  MNIM  tint  not  only  l»iol»  of  Coteioiy  A  t»t  olio  thoM  of  B  irt 

* 

Classified  correctly,  on  the  svecsfe. 

It  is  iaportsnt  to  note  that  the  abofe  ptoble*  is  not  aiaed  at  aaai- 

.iaini  the  mia^t  of  correct  decisions.  Instead  it  -akes  the  correct  decisions 
nost  uncQuirocal,  on  the  average.  It  is  substantially  nore  difficult  to 
naxiniae  the  nunber  of  correct  classifications.  Por  that  purpose  a  binary 
function  would  have  to  be  defined  which  assunes  the  wore  positive  of  its  two 
values  whenever  a  decision  is  correct  and,  conversely,  assuwes  the  lower 
value  for  incorrect  cUssifications.  The  suw  of  this  binary  function  evaluated 
for  each  labeled  point  would  have  to  be  Bsxlwixed.  This  problew  does  not  lend 
Itself  to  ready  analytical  solution;  it  way  be  handled,  however,  by  cowputer 

nethods. 


5.2  A  Separating  Transforwation 

The  particular  linear  transformation  which  Baxiwires  the  wean-square 
interset  distance  while  holding  the  sum  of  the  wean-square  inter  and  Intraset 
distances  constant  is  developed  below.  Recall  that  the  purpose  of  this  trans- 
fornation  is  to" separate  events  of  dissimilar  categories  while  clustering 


those  which  belong  to  the  sane  class. 

The  wean-square  distance  between  the  members  of  the  set  and 

the  members  of  the  set  ^Gp^,  after  their  linear  transformation,  is  given 
in  aquation  3.1,  where  and  gp^  are  the ‘s*^  coefficients  of  the  w**’  and 

members  of  the  sets  and  ^Gp"^,  respectively.  Por  the  sake  of  notational 

simplicity  this  mean-square  interset  distance  is  denoted  by  ^G^j  and 

*  The  symmetrical  situation  where  S|P,^A^j|  -  SjP,[B^^|  for  PeB  is  also 
maximized  leads  to  the  same  solution. 
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Is  the  quMtitf  to  be  Mxlaised  by  a  suitable  choice  of  the  linear  traos* 
focMtion.  Tbe  choice  of  the  notation  above  is  Intended  to  signiff  that  the  . 
tranafomation  to  be  found  is  a  function  of  the  tuo  sets. 


The  constraint  that  the  nean-SQuar^  distance  0  between  points 
regardless  of  the  set  to  which  they  belong  is  a  constant,  is  expressed  by 
B<]uation  3.2,  where  V"  is  the  coefficient  of  any  point  belonging  to  the 
union  of  the  sets  [fJ  and  fc},  snd  M  . 

M  M  N  r  N 

®  Z  Z  Z  “n»(^s  "Vl  *  (5.2) 

o«l  p*l  n»l  Ls«l  J 


Both. of  the  above  equations  nay  be  sinplified  by  expanding  the 
squares  as  double  suns  and  interchanging  the  order  of  sunnatiens.  Carrying 
out  the  indicated  operations,  we  obtain  Equations  5.3  and  5.<t. 

N  N  N 

fp?)  ■  Z  Z  Z”"P  “nr  ’■sr-  <5.: 

n=l  s=l  r«l 


where 


*sr  “  ’'rs 


4z  ^ 


Rlsl  pd 


ms 


(5.3b) 
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and 


..ff 

Dal  S«1  r*! 


where 


M  M 

'.r  •  *rs  ■  ^ 


My  Z-  ”  ^s|K"r  ”  ^prl' 

Ral  pal 


(S.4a) 


(5.4b) 


The  coefficient  is  the  general  elenent  of  the  natrix  ^Xj  which  is  of  the 
forn  of  a  covariance  natrix  and  arises  fron  considerations  of  cross-set 
distances.  The  nstrix  ^tJ  with  general  coefficient  t^^,  on  the  other  hand, 
arises  fron  considerations  involving  distances  between  the  total  nunber  of 
points  of  all  sets. 

We  now  maximize  Equation  5,3,  subject  to  the  constraint  of 

Equation  5.4a  by  the  method  of. Lagrange  multipliers.  Since  dw  --is  arbitrary 

ns 

in  Equation  5.5,  Equation  5.6  must  be  satisfied. 


(5.5) 


N 

•*.  5]  ”nr(’‘sr  ”  ’'*srl  *  n«l,2,...,N:  sal,2,...,N.  <5.^“) 

r«l 
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Equation  S.'6  nay  b«  Mrltten  in  natrlx  notatioa  to  exhibit  the 
solution  in  an  iliuninatiwj  tmy.  If  ue  let  be  a  xeetor  with  N  conponents 
w  .  ...  w  w.  then  fiquation  5.6  nay  be  written  as  in  Equation  5.7a, 

n  ji  fiN 


(5.7a) 


By  posiniultiplying  both  sides  of  the  equation  by  T  we  obtain  Equation  (5.-7b) 
which  is  in  the  forn  of  an  eigenvalue  probtee. 


WjXr’^  -  XI I  -  0 

xr“^  -  xil  *0 


(5.7b) 


Note,  that  T  ^  always  exists  since  T  is  positive  definite.  *Bquations  5.7a  and  b. 
may  be  satisfied  in  either  ot  two  ways.  Either  the  row  of  the 
linear  transformation  described  by  the  oatrix  [wj,  is  identically  zero,  or  it 


is  an  eigenvector  of  the  matrix  |  XT  We  must  substitute  back  into  the 
mean-square  inter  set  distance  given  by  Equation  5.3a  to  find  the  solution  which 
maximizes  S..  To  facilitate  this  substitution,  we  recognize  .that  through 
matrix  notation  Equations  5.3a  and  5.4a  may  be  written  as  Equations  5.8  and 
5.9,  respectively. 


ft 
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(S.l) 


N 

•  •  *1  ■  «•»> 

nal 

But  Fro*  BquAtloo  S.7«  m  see  that  mj  »lMfs  be  replaced  bf 

Carrying  out  this  aubatitutlon  in  5.8,  we  obtain  Equation  5.10,  where  the 

conatraint  of  5.9  ia  alao  utilized. 

N 

MW’  [Ml  ■  Z  ^ 

*  n»l 


It  ia  now  apparent  that  the  Mekeat  eigenvalue  of  •  \rj  ■  0  yields  the  rows 
of  the  tranafornation  aaaimises  ^he  «ean-square  interaet  distance 
subject  to  the  constraint  that  the  nean^square  value  of  all  distances  is  a 
constant.  The  transforaation  is  stated  by  Ajuation  5.U,  where 

’*12' ■  *lf*nvector  corresponding  to 

’'ll  ’'12  •••  ’'in 
fwl-  ’*^2  •••  ’'in 

rJ  .  '  (5.11) 

''11  ’'12  •••  ’'in 

L-  J. 

The  irsnsfernation  of  the  equation  above  is  singular,  expressing  the 
fact  that  the  projection  of  the  points  along  the  line  of  naxinun  nean-square 
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internet  distance  and  nininua  intraset  distance  is  the  only  inpoctant  feature 
of  events  that  detemines  their  class  nenbership.  This  is  illustrated  in 
Figure  U. inhere  line  aa*  is  in  the  direction  of  the  I'irst  eigenvector  of  the 
natrix  jxT  A  point  of  unknown  classification  is  grouped  in  Category  B 
because  the  nean-square  difference  between  its  projection  on  line  aa*  and  the 
projection  of  points  belonging  to  set  B,  S  |p,  is  less  than  S 

the  corresponding  difference  with  nenbers  of  set  A. 


Figure  U  A  Singular  Class-Separating  Transforaation 

Forcing  the  separating  transformation  to  be  non-singular  is  possible 
by  the  imposition  of  a  different  constraint  on  the  maximization.  Unfortu¬ 
nately,  the  mathematical  difficulty  of  imposing  non-sinr;ularity  directly  is  a 
formidable  task.  In  general  it  requires  evaluating  a  determinant,  sue;:  as  the 
virauian,  and  assuring  that  it  does  not  vanish.  In  the  following  discussion, 
at  first  a  seemingly  meaningless  constraint  will  be  imposed  on  the  maximization 
of  the  mean-square  interset  distance.  After  the  solution  is  obtained,  it 
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will  be  shown  that  the  neaningless  constraint  can  be  cooverted  to  a  constraint 
which  holds  the  nean-s^are  of  all  distances  constant*— the  sane  constraint 
we  used  previously. 

'The  nean-square  interset  distance  to  be  naninized  is  given  by 
Equation  S.3a  which  is  reproduced  here  as  Equation  S.12. 

N  N  N 

“  (W*  •  I  Z  E "« ««•  «•“> 

n«l  Sal  ral 


The  constraint  we  will  inpose  is  that  the  nean-square  length  of  the  projections 
of  all  distances  between  any  pair  of  points  onto  the  directions  be  fixed, 
but  in  general,  different  constants.  This  constraint  ia  expreased  by 
Equation  3.13  which  differs  fron  the  previously  used  constraint  of  ^uation  5.>l 
only  by  fixing  coordinate  by  coordinate  the  nean-square  value  of' all  possible 
distances  between ^points. 

t 

n  N 

Z  5^*1..  *«  •  "»• "•*>2 . "• 

Sal  ral 


Assigning  an  arbitrary  constant  to  the  differential  of  each  of  the  above 
N  constraints  and  using  the  method  of  .Lagrange  nultipliers  in  the  iRaxinization 
of  S  above.  Equation  5.14  is  obtained. 
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;Vlien  we  nake  ‘.ise  of  the  convenient  tutrix  notation  enployed  earlier,  we 
obtain  equation  5.15  which  differs  significantly  fron  equation  5.7a,  despite 
the  sinilar  appearance  of  the  two  equations. 


I  -  Xjl)  . 
i-XjT]. 


0 

0 


(5.15) 


‘n[»- v] 


The  solution  of  equation  5.15  states  that  each  row  of  the  linear  transforwation, 
is  a  different  eigenvector  of  the  T  Mitrix.  The  transforaation  [w] 
is  therefore  orthogonal,  equation  5.16  is  a  further  constraint  which  converts 
that  of  5.15  to  holding  the  nean-square  of  all  distances  constant,  and  thus 
acconplishes  the  aia  of  this  section. 


N 


Iv 


(5.16) 


nal 


Note  that  before  we  knew  that  the  rows  of  the  transforoation  |^wj 
would  be  orthogonal,  the  condition  expressed  by  equation  5.16  does  not  fix  the 
total  distances.  .The  above  procedure  resulted  in  finding  the  non-singular 

# 

orthogonal  transformation  which  optimally  separates  the  classes  and  optimally 
clusters  members  of  the  same  class. 

We  will  now  compute  the  mean-square  interset  distance  S  of 

•  '  ^ 

equation  5.12.  To  facilitate  the  computation,  S  will  be  written  in  matrix 
notation  as  in  Equation  3.17. 
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‘•fljcn  we  nakc  '.ise  of  the  convenient  tutrix  notation  enployed  earlier,  we 
obtain  Aquation  5.15  which  differs  significantly  fron  Aquation  5.7a,  despite 
the  sitailar  appearance  of  the  two  equations. 


*il*  vl  ■  ® 

Kjl-XjTl.O 


(3.15) 


The  solution  of  Aquation  5.15  states  that  each  row  of  the  linear  transforaation, 
is  a  different  eigenvector  of  the  ^1  T  natrix.  The  transforaation  [wj 
is  therefore  orthogonal.  Aquation  5.16  is  a  further  constraint  which  converts 
that  of  5. IS  to  holding  the  nean-square  of  all  distances  constant,  and  thus 
acconplishes  the  ala  of  this  section. 

f 

N 

K  ■  ^  Kjj.  (5.16) 

n>l 


Note  that  before  we  knew  that  the  rows  of  the  transforciation 
would  be  orthogonal,  the  condition  expressed  by  Aquation  3.16  does  not  fix  the 
total  distances.  The  above  procedure  resulted  in  finding  the  non-singular 

# 

orthogonal  transformation  which  optimally  separates  the  classes  and  optimally 
clusters  members  of  the  same  class. 

ife  will  now  compute  the  mean-square  interset  distance  S  of 
Aquation  5.12.  To  facilitate  the  computation,  S  will  be  written  in  matrix 
notation  as  in  Aquation  5.17. 
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(5;i7) 


FCMI  Equation  5.15  it  ia  s«en,  however,  that  if  S  is  waainuo,  *»y  ^ 
tepUeed  with  to  obtain  Bquatlon  5.18.  where 


(5.18) 


froii  aquation  5.13  (in  .atrlx  notation).  Aquation  5.19  is  thus  obtained. 

It  is  so.  ro«lUy  seen.  «lth  r.l.tooce  to  «*..tion  5.10.  th.t  the  wper  bound 
on  tb.  ««.-.dU.to  intotstt  dlstwc.  Is  schin.td  b,  tbt  sinfuUr  ttM.«.rMtion 
discosstd  tstlior.  «.d  PS),  lor  fotcin,  tb.  tr«..(or«.tion  to  b.  non-.inp.Ur 
*by  achieving  only  a  reduced  separability  of  classes. 


’nsslW-  i°P^l  •  Z 


(5.19) 


a 

Before  leaving  the  discussion  of  class-separating  transfornations, 
a  few  important  facts  must  be  pointed  out.  A  pimple  formal  repUcewent  of 
the  matrices  X  and  T  by  other  suitably  chosen  matrices  yields  the  solution  of 
many  Interesting  and  useful  problems.  It  is  not  the  purpose  of  the  following 
remarks  to  catalog  the  problems  solved  by  the  formal  solution  previously 
obtained;  yet  some  deserve  mAntion  because  of  their  importance.  It  may  be 
readily  verified,  for  Instance,  that  replacing  T  by  I  is  equivalent  to 
maximiring  the  between-set  distances,  subject  to  the  condition  that  the  volume 
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of  th.  .P.«  i.  .  «hic»  .ceo-pu-...  tbu  i. 

ortho8on.l  »ltl.  ro«  .««.l  »«  .l|e"«ctot.  of  tM  »«ri*  I.  thi. 

1.  .  pp,.l«uv  O0.1OU.  r..olt.  of  ooor...  .looe  «o  ,f,...«.or.  of  X  «.  «. 
5.t  of  o«ho,oo.l  dlroetloo.  .long  -.leh  ln«r«f  Oi.tMC.  .r.  -.InUrt. 
on  to.  ...W.  A  flgoK  -.10*  -oold  fllo.lr...  r...l.  i.  »«,  to 

Figure  IX* 

Anotlxr  tepUcwont  «l>Uh  ...t  ».  ..ln*l«l  oot  1.  th.  .oh.tltotion 
of  ,h.  «.rl.  I  for  T.  nher.  I  1.  th.  co«rf.n«  ..trl.  ...o.i.t«l  -Ith  .11 
intr.tct  di.tio...  (di.t.nee.  .nong  ^11.  .Ttptj).  fligenveelor.  of  Jx  j 
fo«  co»  of  th.  tr.n.forMtion  »hl.h  uiUi...  lnt.r..t  dl.t.ne..  «hil. 
holding  intr....  di...nc..  con.tnnt.  Ihl.  prohl..  1.  ....ntl.llT  th. 

..  th.  «,i.i«.ion  of  ln..r«.  di.t.n...  -hll.  holding  th.  ».  of  in...  »d 
intr.s.t  di.t»c..  con.t«t.  y.t  th.  Klativ.  «p.r.tion  of  s.ts  .chi...d  by 
th.  two  tr«5forn.tion.  i.  diff.t.nt.  Th.  dlff.r....  ~y  h.  .ri.ibl.«l  by 
.oopoting  th.  r.tlo  of  th.  ..«.-soo.r.  ..pOMtlon  of  ..t.  to  the  ...n  clust.r- 
ing  of  .Ireents  within  th.  «m.  net.  *5  oeomral  by  th.  lntr...t 

distance.  It  ..y  b.  conelodrt,  therefore,  that  th.  constraint  Bsployrt  in 
th.  waxiniction  of  inter...  distances  doe.  h...  an  influence  on  the  degree 

of  separation  achieved  between  seta. 

Throughout  this  chapter  the  class-separating  transformations  were 

developed  by  reference  to  the  existence  of  only  two  sets,  and  {^p^.  The 

results  obtained  by  these  methods  are  more  general,  however,  because  they 
apply  directly  to  the  separation  of  an  arbitrary  number  of  sets.  For  instance 
in  the  maximization  of  the  mean-square  interset 'distance,  there  is  no  reason 
why  the  matrix  X  should  involve  interset  distances  between  only  two  sets.  An 
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.rtit..ry  nurt-r  of  ten  «y  be  iomlw>..»l  «be  lbtet«t  df.UKe.  .«  .ioplT 
•11  tboae  dlstibcel  ■«•»••()  betoeeo  two  point*  "»«  1"  the  •Me  ••«.  SUllnt 
uguwent*  *10  V.lin  for  *11  the  other  ■•trlce«  imoleei.  Ibe  only  preewitlon 
th.t  nost  be  t.ken  eoneern.  the  po..lble  use  of  «Wl*l.o.l  eon.tt.lnti  .peel- 
fyii»  piefetentinl  or  noopreferentUl  tte»tn«it  of  cUMt*.  Ibe»e  •dditlonnl 
constraints  mkj  take  the  forn  of  requirlni,  for  Instsoce.  that  the  «ean- 
square  intraaet  distance  of  all  sets  be  equal  or  be  related  to  each  other  by 
aone  constants.  Aside  dom  these  ninor  natters,  the  results  apply  to  the 
separation  of  any  nunber  of  classes. 

5.3  Ma»liaitatton  of  Correct  Classifitpations 

The  correct  classification  of  points  of  the  set  F  are  ifade  i»ore 
unequivocal  by  the  linear  transforation  which  tiakea  any  event  of  sefF 
Hore  sieilar  to  fienbera  of  F,  on  the  avera|e,  than  to  those  of  another  set  G. 

One  of  the  ways  in  which  the  average  unequivocalness  of  correct  cUsaificatory 
decisions  nay  be  stated  nathenat Ically  is  to  require  that  a  nunerical  value 
associated  with  the  quality  of  a  decision  be  naxinized,  on  the  average.  Of 
the  several  quantitative  measures  of  the  quality  of  a  decision  which  nay  be 
defined,  one  that  readily  lends  itself  to  nathenat ical  treatnent  is  given  in 
Equation  5.20.  The  difference  in  the  slniUrity  between  a  point  P  and  each  of 
the  two  sets,  P  and  0,  is  a  qyantlty  Q  which  is  larger  if  the  decision  regarding 
the  classification  of  P  is  more  unequivocal. 

M'- p.i))  ■ 
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Since  decisions  in  previous  chapters  were  tiased  on. the  conparison  of  Q  with 
a  suitable  threshold  value  (such  as  zero),  we  now  wiah*to  find  that  lincu 
tranafoniation  which  naxlHitea  9.  on  the  average,  whenever  Q  ia  to  be  ^oaitive. 

If  P  is  a  nenber  of  the  set  F,  then  P  is  closer,  to  P  than  to  C  and  thus  Q 
is  to  be  positive.  The  naxiniaation  of  Q  for  PeP  results  in  naxiniting  the 
nargin  with  which  correct  decisions  are  nade,  on  the  average.  The  foregoing 
naxinizatlon  ia  stated  in  Bquation  3.21  subject  to  the  constraint  expressed 
by*  Bquation  5.22.  The  latter  sinply  states  that  If  PcG,  the  average  decision  ia 
still  correct,  aa  neaaured  by  the  nargin  R* 

N)  -  *(^•  w)  ■  Q  ■  naxinun,  subject  to  (5,21) 

- n  , 

^  ^n‘  {\])  *  *(°n'  {S])  ■  *  “  ®°"***"*  >  (5.22) 

« 

a 

Utilizing  previously  obtained  results,  the  above  equations  are 
readily  solved  for  the  optinun  linear  transfomation.  Rewriting  the  first  tern  , 
of  Bquation  3,21,  we  note  that  it  expresses  the  nean-square  interset  distance 

f 

between  seta  P  and  G  and  nay  be  written  aa  in  Equation  3,23,  where  Bquation 
3.1  and  the  sinplifying  notation  of  3.3  are  enp'loyed. 

(5.23a) 
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_ I.  N  H  N 

»('«•  (“»])  *  I II  •«  V 
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(S.23b) 


TM  HeowJ  Utm  of  Bqootlon  5.21  lo  the  ■ein-tquire  intrMct  distMCC  of  oet  P 
•nd  My  be  expressed  ss  in  BQtistlon  5.24.  The  stgoaeot  of  the  coesrisnee 
coefficient  siinifies  thst  it  is  s  covsrUnce  of  ciSMOts  of  the  set  P. 

«.  »  r  1.  f 


(%•  {'-1)  •  *({'-]  •  f '.])  • 


N  N  N 


<>.■  ('.))  ■  1 1 1 
'  '  ^  n-1  s«l  r«l 


w  w  u.  ( P> . 
ns  nr  sr 


(S.24b) 


Sinllsrly.  the  first  tern  of  Bqustion  5.22  is  the  BesB^squsre  interset  distsnee, 
and  the  second  terw  is  the  intrsaet ’distance  of  set  G.  The  Baxiaiizstioa 
problen  can  thua  be  restated  by  Equation  5.25a  and  b. 


Maxinize  Q 


N  N 

2M. 

I  I  “nr 

X.  - 

sr  Mj-1  sr 

sal  ral 

(S.2Sa) 


N  N  N 
Subject  to  K  ■  ^  ^  ^ 
.  nal  ssl  ral 


w  .  w  X  -  ij  u  (G)  . 
ns  nr  sr  Mg*!  sr 


(5.25b) 


Following  the  methods  used  earlier,  the  solution  of  the  above  problem  Bay  be 
written  down  by  inspection. 
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fxon  5.26  It  followa  that  Qquation  5.27a  luat  hold,  where  and  arc 
lieen  by  figuationa  5.27b  and  c. 

N 

Z!  **or(*ar  “yar^  *  n»l,2,...,N  and  ••1,2 . W  (5.27a) 

ral 


2M. 


(f  •  X  *  ■  a  u  (F) 
"sr  *«r  M^-l 


(5.27b) 


^2 

Ar  “  *sr  ’  fipT  “sr^®^  (5.27c) 

By  reference  to  earlier  reaulta,  auch  aa  thoae  expreaaed  by  Bquatlon  5.6, 
the  transfornation  whoae  coefficienta  aatiafy  an  equation  of  the  forn 
above,  is  the  aolution  of  the  eigenvalue  problen  of  Equation  5.2B,  where 
ia  a  row  of  the  natrix  expresalng  the  linear  tranafortaition. 


«-  -Kfi 
Wj  iof-  \P 


m  0 
■  0 


w. 


N 


-  0 


(5.28) 


Analogous  to  the  arguments  used  in  the  previous  section,  the  above  solution 
yields  a  singular  transformation.  Forcing  the  transformation  to  be  non- 
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•ingulur,  in  the  aanner  already  outlined,  results  in  the  optiaun  transfonation  * 
as  an  orthogonal  transfocaation,  where  each  row  of  the  aatrix  [w]  ia  an  eigen¬ 
vector  of  0.  Mrtheraore,  it  is  readily  shown  that  the  solution  so 

obtained  indeed  naslnises 

It  is  intereafiol  to  note  that  the  naxinisation  of  the  average 
correct  elass'ifleationa can  be  considered  as  the  naniniaation  of  the  difference 
between  inter  and  intraset  diataneea.  This  alternate  statewent  of  the  problen 
nay  be  exhibited  by  the  addition  of  Iquation  S.25b  to  5.35a 


5  ♦r 


39) 


But  the  expression  within  the  braces  is  siaply  the  covariance  associated 
with  all  intraset  distances.  Since  K  is  a  constant,  the  naxinizatioo  of 
Equation  5.29  is  equivalent  to  the  oaxinization  of  Q. 

In  summing  up  the  results  of  this  .chapter,  it  is  seen  that  the  problem 
of  learning  to  measure  similarity  to  events  of  a  common  category,  while  profiting 
frcn  knowledge  of  nonmembers  of  the  same  category,  may  be  treated  as  a  maximi¬ 
zation  or  minimization  problem.  A  metric  or  a  linear  transformation  is  found 

•  • 

from  a  class  of  metrics  or  transformations  which  solves  nathenatical  problems  * 
which  express  the  desire 'not  only  to  clustei^  events  known  to  belong  to 
the  same  category,  but  alfo  to  separate  those  which  belong  to  different  cate¬ 
gories.  Within  the  restricted  class  of  metrics  or  transformations  considered 
in  this  chapter,  the  solutions  are  in  the  form  of  eigenvalue  problems 
which  emphasize  features  that  examples  of  a  category  have  in  common,  and  which 
at  the  same  time  differ  from  features  of  other  categories. 
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7.  COHCUISIOHS  AlfD  RECOFfffiMDATiaiS 

The  /ippllcatloD  of  pattern  recognition  to  missile  detection  and 
decoy  diacrlmlnation  was  investigated.  As  a  result  of  these  investigations, 
a  mathematical  theory  vas  develc^ed  to  achieve  the  optimum  combination  of 
measurable  or  cosqiutable  target  properties  to  achieve  the  separation  of 
target  types,  based  on  their  observable  properties.  The  input  to  the  theory 
la  the  set  of  discrimination  technique's  and  useful  target- signature  para- 
•  meters.  The  theory  operates  on  these  Inputs  to  determine  their  optimum 
method  of  cosibination  and  thus  determine  what  the  essential  properties 
of  targets  are  that  allow  their  recognition.  An  important  attribute  of 
the  theory  is  that  it  does  not  become'  obsolete  as  new  techniques  are  devel- 
oped  and  break-throughs  are  achieved  in  methods  of  distinguishing  one  type 
of  target  or  threat  from  another.  The  achievement  of  the  present  theory  is 
that  it  states  howp techniques  developed  by  others *are  .to  be  combined. 


The  recommendaticmi  for  further  work,  in  part,  is  to  develop  the 


theory — which  has  thus  far  shown  promising  results  —  and  to  explore  the 
already  known  avenuea  of  research  opened*  up  by  current  Inveatlgatlons  under 
the  present  program.  Simultaneously  with  further  .theoretical  developments, 
the  already  succassfully  tested  methods  should  be  applied  to  actual  missile 
data.  Retailed  recommendations  have  been  outlined  In  a  proposal. 

.  In  the  course  of  work  performed  during  the  present  contractual 

•  • 

effort,  and  as  a  result  of  the  Investigations  carried  out  during  this  effort, 

certain  facts  relating  to  Ballistic  MisBlle  Defense,  and  In  particular,  to 

the  problems  of  Missile  Detection  and  Decoy  Discrimination,  became  apparent. 

» 

These  facts  are  listed  below  and  are  substantiated  In  the  body  of  this  report. 


UNCLASSIFIED 


UNCLASSIFIED 


a)  It  nay  be  Inferred  from  tlie  reaulta  of  the  preaent  atudy  that 
present  computers  are  entirely  adequate  in  size  and  in  speed  of  operation  to 
handle  the  Job  of  Missile  Detection  and  Decoy  Discrimination  in  Ballistic 

t 

Missile  Defense.  Neither  the  coaqilexity  nor  the  number  of  calculations  are 
outside  the  capabilities  of  presently  available  computers. 

b)  The  instrumentation  of  the  theoretical  techniques  developed 

•  • 

during  the  present  effort  would  serve  a  two- fold  purpose.  On  the  one  hand, 

it  la  the  working  system  with  which  missile  detection  and  decoy  discrimina- 
% 

tlon  could  be  carried  out  in  the  field:  on  the  other  hand,  it  would  serve 

» 

as  a  research  tool  with  which  both  the  usefulness  of  new  sensora  and  recog- 

4 

nltion  techniques  can  be  evaluated,  and  the  direction  of  desirable  further 
data  collection*  efforts  pinpointed. 

c)  It  is  felt  that ’the  system  suggested  in  this  report  la  both 
the  final  system  which  performs  missile  detection  and  decoy  discrimination 
in  the  field,  as  well  aa  the  tool  with  which  to  dlacover  how  the  Job  is  to 
be  done  and  what  the  missile  signatures  are  which  permit  discrimination 
between  missiles  and  other  targets.  For  this  reason,  it  appears  that, 
stemming  from  this  dual  use,  the  proposed  system  will  accomplish  a  signi¬ 
ficant  saving  in  the  developmental  time  of  a  ballistic-missile  defense 
system. 

d)  It  is  generally  realized  that  no  single  measurable  or  compu¬ 
table  parameter  will  be  adequate  to  perform  missile  detection  and  decoy 
discrimination  with  sufficient  reliability.  It  is  believed  that  a  combin¬ 
ation  of  techniques  will  be  necessary  to  perform  these  tasks.  While  such 
notions  are  generally  entertained,  the  precise  method  of  combination  of 
the  various  and  basically  different  techniques  remains  unsolved.  In  this 
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report  we  can  specify  exactly  bow  the  different  present  and  future  techniques 
which  ore  useful  In  one  situation  or  another  con  be  ccablned  to  yield  optlsal 
discrloinotory  decisions*  Purthensore,  the  oanner  of  cocabinotlon  oi  techni¬ 
ques  will  not  require  ultcnition  either  conceptually  or  in  the  hardware 
Inotncsentatlon,  If  new  and  Icproved  techniques  becooe  available. 

e)  An  Icportont  conclusion  of  the  present  study  effort  is  that  the 
best  utllixotlcn  of  o  cocbi nation  of  techniques  in  a  single  decision  process 

Is  achieved  only  through  the  slcultaneous  collection  of  data.  It  is  obso-  « 
lutely  necessary  that  sieultaneously  collected  seasurecents  by  all  sensors 
and  data  processing  systess  be  ovoilable  if  the  optinua  coabinution  of  tech¬ 
niques  is  to  be  found. 

f)  Since  a  technique  of*dota  processing  which  utilizes  the  coabln- 
atlon  of  detection  and  discrimination  techniques  transcending  several  different 
kinds  of  primary  sensors  will  be  used,  such  a  system  must  face  the  problem  of 
lock  of  uniformity  of  data  storage.  It  Is  one  of  the  conclusions  uf  the  present 
study  that  uniformity  of  data  storage  should  be  a  desirable  objective  from  the 
very  beginning,  lest  the  problem  take  on  insurmountable  proportions  later  on. 

TTie  requirement  should  be  established  to  maintain  adherence  to  a  mutually 
satisfactory  data  storage  format. 
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APPEMDIX  A 

Thg  Solution  of  BigenvAlue  ProbAew 

The  frequency  with  which  eigenvalue  problens  occur  in  clOMificatory 

analyeia  and  the  difficulty  with  which  they  are  solved  warrants  the  careful 

examination  of  the  avallabte  nethodt  of  solution.  The  slowest  link  in  the 

confutations  involved  in  the  ;  rr>ccss  of  forming  category  membership  measuring 

functions  is  the  solution  of  eigenvalue  problens.  It  takes  a  considerable 

length  of  tine  for  even  a  fast  computer  to  solve  for  the  eigenvalues  and  vectors 

of  a  large  matrix.  This  limits  the  speed  with  which  the  influence  of  a  new 

event  is  felt  on  the  recognition  process.  In  order  that  machine  learning  be 

carried  out  in  essentially  "real  tine",  it  is  necessary  to  search  for  a'' physical 

phenonenon  or  a  natural  process  which  is  the  solution  of  an  eigenvalue  problem. 

\ 

The  nattiral  phenomenon  oust  have  enough  controllable  parameters  to  allow  the 
settinj  up  of  an  arbitrary  positive  definite  symmetric  matrix.  Tlic  objective 
of  thlsAppendix  is  to  focus  attention  on  the  importance  of  finding  such  a 

s 

natural  phenomenon  and  to  give  an  example  which-^lthough  not  completely  general, 
as  we  shall  sec,  nor  a  practical  as  some  would  like— docs  demonstrate  the 
feasibility  of  solving  eigenvalue  problems  very  rapidly. 

Consider  the  two-loop  lossless  network  of  Figure  A-1  which  is 
excited  with  a  voltage  source  at  its  input.  Letting  the  complex  frequency 
be  X  and  the  reciprocal  capacitance  ( susceptance)  values  be  callcJ  S,  the 
loop  equations  of  the  network  may  be  written  as  in  Equation  A.l. 
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Multiplying  both  sides  of  the  eqtistion  by  X  &nd  writing  it  in  ostrix  notation, 

s 

.  we  dbtsio  Bqustion  A«a,  where  e  snd  ^  are  Tectots  of  the  voltage  excitations 
in  the  loops  and  leqp  eurrtst's.  Tespectlwely. 


(A. 2) 
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If  the  input  ii  ihort-circuited  and  the  vector.^  ii  *ciw,  mty  noii-«ro  cukccu. 
tluit  flows  in  the  network  mist  do  so  at  freQueneicf  that  satisfy  Nquatioo  A, 4, 
where  use  is  aadc  of  the  knowldfe  that  a  losslcta  network  mist  oseilUte  at 
pure  iMfinary  freQuencies  X  •  Jw.  The  resulting  equation  la  an  eigenvalue 
probleti  of  the  sane  type  encountered  throughoot  ia  this  eel|Me« 

[x*L  ♦  c]  ■  0  ■  [c  -  uhj 

The  natrix  is  a  completely  arbitrary,  synnetric,  positive  definite  aatrix 
whose  coefficients  each  are  controlled  by  (at  most)  two  circuit  eleaents. 

The  natrix  |cj  is  also  synwetrie  and  positive  definite,  but  its  elements  which 
are  off  the  principal  diagonal  must  be  negative  or  sero.  This  does  not  have 
to  be  the  case  .in  the  matrix. for  a  negative  mutual  Inductance  is  quite 
realisable.  Nute,  however,  that  if  the  autual  capacitance  is  short-circuited, 
and  the  other  capacitors  ate  made  equal  (for  convenience  let  then  be  unity), 

o 

then  the  nstural  frequencies  of  oscillation  of  the  short-circuited  network 
satisfy  the  eigenvalue  problen  of  Equation  A. 5. 


•  CA.5) 


The  mdst  general  two-loop  network  corresponding  to  this  equation  is  shown  in 


Figure  A-2,  where  a  transformer  replaces  the  mutual  inductances 


-IS. 


Itjz 


Figure  A-2.  Network  Solution  of  an  Eigenvalue  I’rihlem 


3,08 

UNCLASSIFIED 


UNCLASSIFIED 

Th»  .«  of  “o  ««‘l>fo 

of  ooeHUtioo.  Co»P<>n«nt.  of  tue  ofiomctor  cottoipoodl"*  «•  »  l‘«o 
elienvaliie  ore  tl.o  .oinltodoa  of  tuo  loop  eotteoto  ot  the  eotroopoodini 

frequency. 

since  looileo.  netoorho  tonoot  be  built  In  peoetice.  let  ut  loeeu- 
tlsotc  the  effect  of  loooe.  l»  the  .et«e.  of  Pi, ore  *-l.  If  »  -U  «.I..o.ce 
i.  connected  in  oecie.  uith  every  Inductonce  «ieh  thct  the  fre«eoer  of 

IscilUtlon  in  ““ 

the  pole,  of  the  net«rt.  the  e«.r  in  u.ln,  th^  d..,»d  n.tut.1  frequencie.  Cj 

in  puce  of  the  «,d..p.d  frequencies  n.y  be  colcoUted.  The  petcent.je  error 

of  detemlnio,  the  el.en.olue.  is  ,i»en  in  Bquotl™.  A.O  .wressed  in  tern,  of 

the  Q  of  the  resonant  circuits. 

.  1  1  .  100  -  (A. 6) 

%  error  in  eigenvalues  •  J— 

(aq)  -  1 

a 

Bveo  for  a  lossy  network  having  a  Q  of  10,  the  error  is  only  0.25%.  Ne 
■ay  thus  draw  the  conclusion  that  network  losses  don't  seriously  affect  the 

*  accuracy  of  the  eigenvalues. 

The  eigenvalues  nay  be  obtained  by  spectrun  analysis  of  any  of  the 

loop  currents.  This  is  readily  accomplished  by  feeding  the  voltage  across  any 
of  the  series  resistances  into  a'tunable  narrow-band  filter  whose  tuning 
frequencies  corresponding  to  peak  outputs  yield  the  eigenvalues.  Tho 
corresponding  eigenvector  may  be  obtained  by.sampling  the  output  amplitudes 
.  of  synchronously  narrow  filters  connected  measure  each  loop  current. 

The  samples  are  taken  when  local  peak  outputs  with  tuning  are  observed. 
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The  *i*e  of  the  n&trix  solved  by  the  preceding  nethods  nay  be  nale 
arbitrarily  large.  The  reader  can  readily  verify  that  if  the  natrlx  whose 
eigenvalues  and  vectors  we  wish  to  conpute  is  N  x  N,  then  the  network  topology 
has  to  consist  of  H  nodes  which  art  connected  to  each  other  sod  to  ground  by 
series  IC  networks  as  Illustrated  by  Pigure  A>3  for  N*S. 
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APPE3fl)IX  B.  (a?  PATTERN  RECOGWITIOH 
In  this  Appendix  we  briefly  outline  a  number  of  approaches  to 
the  subject  which  hove  been  taken  by  other  people.  There  is  a  thread  of 
similarity  among  all  of  them.  Yet  they  all  differ^  and  some  widely,  for 

O 

they  each  ask  their  questions  from, separate  frames  of  reference.  And  as 
Susanne  Langer  and  Sir  Arthur  Eddington  have  suggested  —  the  way  we  ask 
a  question  determines  or  frames  our  answer.  The  Ichthyologist  who  sets 
out  to  study  the  ocean  life  with  a  net  having  two  Inch  openings  comes  to 
the  "remarkable"  conclusion  that  there  are  no  fish  smaller  than  two  inches. 

So  too,  each  pattern  recognition  technique  Is  limited.  A  model  chosen  to 
simulate  the  human  nervous  system  Is  limited  by  the  "designer's"  under¬ 
standing  of  the  nervous  system  and  also  by  the  fact  that  the  biological 
system  may  not  be  the  best  one  for  the  problem  under  examination. 

Pattern  recognition  Is  the  examination  and  classification  of  an 
object  as  a  member  of  one  of  a  number  of  categories,  rather  than  Its  unique 
Identification.  The  recognition  process  Is  Information-destroying,  and 

I 

only  the  Invariant  properties  or  characteristics  of  the  pattern  (or  pat¬ 
terns)  of  interest  convey  significant  Information  for  classification.  Of 

« 

course,  in  order  to  perform  a  classification.  It  Is  necessary  to  know  the 
pattern-characteristic  properties.  Perhaps  these  properties  can  be  chosen 
through  human  Ingehulty  and  study;  and  this,  in  fact,  has  been  the  basis 
for  most  character  recognition  and  speech  recognition  systems  now  In  vogue. 
But  the  more  basic  approach  to  the  problem  attempts  to  develop  a  technique 

I 

^  In  the  preceding  sections  of  this  report,  methods  are  based  on  the 
doctoral  thesis  work  of  George^ Sebestyen. 
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vhej'cby  supervised  exismination  o!  known  snraples  oi'  the  various  categories  per- 
nits  a  computer  tc  "learn'  w'nat  the  parainctcrs  are  which  are  significant  to 
each  categoiy.  The  approach  taken  in  this  report  is  of  the  latter  type. 

Pattern  recognition  Is  closely  related  to  problem  solving.  Probleo 
solving  consists  of  exiiminlng  a  given  set  of  elements  and  finding  a  member 
of  a  subset  having  specified  properties.  Use  is  made  of  processes  which  find 
possible  solutions,  arid  of  other  processes  which  dctencino  whether  a  pro¬ 
posed  solution  is  In  fact  a  solution.  Uce  is  m’lde  of  principles  or  devices 
called  heuristics,  whl'ch  contribute  to  the  simpii f Ication  of  the  search  for 
solutions.  , 

The  actual  measure  of  the  merit  of  any  particular  pattern  recogni-  - 
tion  technique  is  in  the  success  of  u.he  decisions.  All  objects  to  be  class!- 
fied  can  be  represented  in  a  space  —  perhaps  on  n-dimensionnl  vector  space. 

If  the  objects  con  be  associated  with  particular  categories,  then  by  de¬ 
finition  there  is  s  decision  rule  (perhnps  a  rule  defined  on  a  point  set) 
for  perfectly  classifying  every  object,  given  its  location  in  the  vector 
space  (assuming  no  noise,  of  cour.:e).  The  '  learning"  to'^k  is  to  find  the 
decision  rule  from  exam! natiori  of  a  finite  .-.uinber  of  kriowa  examples.  Clearly^ 
the  goal  is  a  technique  whlc;;  c.'::  find  n  g'  d  decision  rule  rapidly  as  the 
number  of  samples  is  increased.  All  of  the  techniques  make  use  of  heuris¬ 
tics.  These  are  aids  or  conctraint ;■  which  iimit  the  number  of  solutions 
under  consideration.  'Ihe.  e  heurl -'ics  are  c-.osen  on  the  basis  of  the  de¬ 
signer's  experience  and/or  im.;r  i  anti  on  Tr  ey  could  be  suijgested  by  a  soph¬ 
isticated  computer.  Often  the  heartstics  cho.sen*  tend  to  degrade  the  tech¬ 
nique  by  hiding  tre  best  bclutioat  This  simiiy  sugge ’f..  +’’.at  a  good 
•  • 
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heuristic  can  be  a  big  help  vhereos  a  poorly  chosen  one  can  eliminate  the 
proper  solutions  altogether.  The  differences  among  the  various  techniques 
lie,  therefore.  In  the  manner  In  which  each  leads  toward  development  (or 
discovery)  of  a  decision  rule. 

We  have  heard  the  term  artificial  intelligence.  This  Is  generally 
applied  In  study  of  machines  which  ore  told  how  to  learn,  but  not  how, 
specifically,  to  perform  a  task.  We  might  further  note  the  subdivision  In 
pattern  recognition  Into  cases  where  the  machine  is  told  how  to  find  what 
makes  the  objects  of  a  particular  category  equivalent,  and  cases  where  the 
machine  does  not  have  this  information.  Thus  we  see  that  the  class  of  pat¬ 
tern  recognition  techniques  includes  machines  which  ore  told  Just  how  to 
identify  objects  (from  given  truth  tobies),  machines  which  are  supplied  with 
helpful  hints  (heuristics)  on  how  to  learn  to  classify  objects,  and  machines 
which  are  told  nothing  and  ore  allow^  to  organize  themselves  by  Lrial-and- 
error  and  through  supervised  learning.  In  the  extreme,  of  course,  this 
latter  type  of  machine  has  a  greater  hardship  Imposed  upon  it  than  does  the 
learning  human  mind. 

Supervised  learning  consists  in  presenti.ng  known  samples  to  the 
machine  and  telling  it  to  which  category  the  object  belongs.  This  is  made 
known  to  the  machine  through  a  process  of  reward-and-punishment .  The  super¬ 
visor  naturally  imposes  his  own  notions  on  Just  what  constitutes  the  pattern. 
From  examination  of  samples  used  in  character  recognition  studies  one  imme¬ 
diately  realizes  that  a  particular  letter  is  classed  in  a  particular  category 
only  because  the  supervisor  (in  this  case,  perhaps  the  one  who  drew  the 
character)  chooses  to  classify  this  character  in  this  manner.  To  any  other 

human  being  this  character  migh^  be  confused  with  another  or  it  might  even  be 
unintelligible .  , , , 

'  I  1  « 
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Many  researchers  have  approached  the  subject  through  study  and  sub- 
sequeat  attempted  simulation  of  the  human  nervous  system.  Still  other  have 
felt  that  the  best  approach  Is  to  start  fresh  by  atteiQ>tli]g  to  devise  a  model 
to  solve  the  problem  directly,  without  recourse  to  biological  Justifications. 

It  Is  Interesting  to  note  that  the  models  developed  from  these  markedly  dif¬ 
ferent  approaches  have  stron  similarities  with ’each  other.  This  could  lead 

I 

to  the  encouraging  conclusion  that  progress  ha|  been  good  since  different 
approaches  lead  to  similar  solutions.  More  likely,  however,  the  conclusions 
might  be  that  human  researchers  are  again  caught  In  that  common  quagmire 
wherein  they  persist  In  asking  their  questions  within  the  old  framework,  and 
then  are  surprised  when  they  get  the  same  old  answers.  The  observer  from  the 
next  dimension  chuckles  as  he  watches  our  dilemma.  Just  as  did  the  three-dimen¬ 
sional  visitor  In  the  Flatland  of  Edwin  Abbott.  A  cco^^ely  nev  approach  to 
the  problem  could  be  most  refreshing,  but  of  course  Its  arrival  is  not  now 
predictable . 

The  bibliography  at  the  end  of  this  report  can  senre  as  a  survey  of 

pattern  recognition.  In  reviewing  the  work  done  in  pattern  recognition  let  us 

first  examine  the  studies  of  Newell,  Simon,  ai4  Shaw,  at  Rand,  on  the  pzvceases 

« « 

of  creative  thinking  and  applications'  to  a  Logic  Theorem  Proving  machine  and 
a  chess  playing  machine.  Strictly  speaking,  the  work  Is  not  called  pattern 
recognition,  but  the  Ideas  are  Interesting  and  are  here  presented  in  more  than 
Just  a  passing  manner.  Creative  activity  Is  a  special  class  of  problem-solVing 
activity  characterized  by  novelty,  unconventlonallty,  persistence,  and.  diffi¬ 
culty  In  problem  formulation.  The  Logic  Theorist  (a  computer  programi  and 
possibly  a  machine)  attdoq)ts  to  prove  theorems  (handed  to  it)  of  the  type 
found  In  Principle  Mathenatlca,  and  In  proving  the  theorems  It  thw  conjectures 
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and  proves  nev  theoreas  on  which  the  original  proofs  depend. 

Ve  earlier  defined  problea  solving  and  discussed  the  processes  of 
generation  of  possible  solutions  and  nd  determining  whether  a  proposed  solu¬ 
tion  Is  in  fact  a  solution.  Apparently,  for  large  difficult  problems,  there 
cay  be  large  correlation  between  creativity  and  use  of  triol-and-error  gen¬ 
erators. 

The  Logic  T.heorist  operates  on  only  a  restricted  set  of  proofs,  and 
tests  these.  The  restrictions  ore  on  the  number  of  logic  expressions,  number 
of  synbols  in  each  expression,  and  number  of  different  kinds  of  symbols  used. 
We  may  further  restrict  the  algorithm  by  only  considering  sequences  that  are 
valid  proofs  i.e.,  wLse  initial  expressio.ns  ore  axioms,  or.d  each  of  whose 
expressions  is  derived  from  prior  ones  by  valid  rules  of  inference.  Kow,  one 
approach  could  be  to  generate  first  the  shortest  proofs,  and  then  longer  ones 
by  applying  t.he  rules  of  Inference  (in  ail  possible  ways)  to  the  former 
(shorter)  proofs.  This  Is  working  forward.  Actually,  the  Logic  Therorist 
works  backwards-  The  Logic  Theorist  gene.’-ates  proofs  which  contain  the 
desired  expression  (theorem)  for  the  final  one,  and  logical  expressions 
(obtained  from  logical  inference)  for  the  preceding  ones,  ’^hen  a  p*roof  appears 
whose  Initial  expressions  are  theorems,  we  iiave  fou.nd  the  desired  proof. 

4e  can  specify  a  solution  by  either  specification  by  state  description 
jpgc^f^catlon  by  process  description.  For  example,  in  logic  we  can 
write  out  an  expression  in  the  usual  way,  or  we  can  give  a  sequence  of 
operations  on  the  axioms  (a  proof)  that  will  produce  it. 

/!e  earlier  defined  the  term  heuristic  to  denote  a  principle  or  device 
that  reduces,  on  the  average,  the  search  required  to  reach  a  solution.  Many 
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of  the  resLrletiona  s^r.tioned  earlier  are  heuristics.  Heuristics  are  processes 
to  select  correclly  a  very  a-:uall  part  cf  the  total  prablea— solving  naze  for 
exploratlo::.  Kaat  .heuristics  tieper.ci  nr.  a  st.'-atogj'  that  aodifies  subseruent 
search  os  a  funciicn  of  Iji^ansation  aLtalnC'd  in  previous  search.  IJote  that 
algorithsss  are  foolproof  hcurisiicr.  J:her  heuristics  are  the  following.  In 
logic-proofs,  apply  an  operator  if  this  results  in  ar.  expression  which  core 
closely  resecbles  the  final  expression  than  did  the  previous  one.  Another 
.neuristlc  is  to  ;,et  up  sub-tasks,  it  graphic  example  of  such  problcc  factor¬ 
ization  is  snovr.  m  the  car.e  of  a  safe  with  a  lock  having  10  independent  dials 

nucbered  fror.  00  to  /;».  ;<anoc.T.  twirling  of  dials  would  require,  on  the  average, 
,  20 

1/2  X  10  trials  to  cpen  the  lock.  If  the  lock  is  defective  and  tnerc  is  a 
faint  click  each  ticc  any  dial  Is  turned  to  Its  correct  setting,  then  on  the 
average  orJy  500  trials  arc*  required  open  *I.c  lock  (50  trials  wr  each  dial). 
«e  clgr.t  alsw  note  tr.at  Ir.slg.-.t  into  t.-.e  probler.  structure  is  actually  the 
accuislv-ion  of  an  ndcitl  r.ul  r.ev;rlstlc.  It  cay  be  of  further  interest  to 
ooserve  nere  tr.at  our  prc-pr;ueGsIr.g  of  data  cakes  use  of  cary  heuristics. 

These  ideas  or.  thi.'.k:r.g  hove  tee.n  presented  for  their  general  inter¬ 
est  in  tne  fielc:.  .''one  specific  patte.T.  .’•ecogr.it ion  techniques  .^oliov. 
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Pandegontua-  The  aa-del  of  rafuiesontux  orlginoted  by  0.  Selfridge  consists 
of  four  singes  of  denees.  The  first  stage  cDr.cicls  of  data  collection  (and 
display)  devices,  ihc  second  of  coapiiiational  devices,  the  thls^  of  cognitive 
devices,  nnd  the  fourth  cf  a  decision  unit  vhlch  selects  the  cognitive  device 
having  the  largest  output  The  cognitive  devices  arc  each  associated  with  a 
particular  enteger/  (p'itiera)  in  the  classlf Icatlor.  problea  being  solved. 

These  latter  devices  c':ch  xeasure  (in  soac  sense)  the  siailarlly  between  the 
particular  pette.-r.  t-e  cognitive  device  represents  and  the  ac-yet-unclesslfled 
input.  Soch  cognitive  device  his  an  output  p.'-oportional  to  the  amount  of 
the  afore.'nentlcned  similarity.  The  u.nhnown  inputs  are  introduced  to  the 
data  collection  devices  (through  which  the  physical  word  is  represented), 
and  on  which  the  co-xpulaiLonal  devices  operate.  These  latter  extract  the 
various  "features"  of  the  pattern  and  give  on  output  proportional  to  the 
amount  of  the  .-espective  features.  Between  each  computational  device  and 
each  cognitive  device  there  is  a  weighting  network,  the  value  (weighting) 
of  which  is  determined  during  the  process  of  supervised  learning.  The 
weightings  emphasize  the  features  most  significant  for  each  pattern,  and 
the  process  of  developirig  the  weightings  is  Scriown  as  "feature  weighting". 

The  Pandeiaonluni  io  progrp-Tjned  to  adjust  its  weighting  to  .miniraize  the  output 
of  the  approprlrte  cognitive  device. 

Perceptron .  The  Perceptron  is  a  generic  name  for  a  family  of  pattern  recog¬ 
nition  machines  (oringinally  proposed  by  Frank  Rosenblatt)  that  operate  on 
principles  not  unlike  those  believed  used  in  the  human  brain.  The  percep¬ 
tron  consists  of  sensory  units,  associative  units  (each  one  is  an  effect  a 
variable  memory  unit  in  a  large  switchboard),  and  response  units.  There  is 
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0  response  unit  corresponding  to  eoch  stloulus  class  (categojy),  and  each 
gives  an  output  proportloncl  to  the  slBilarlty'  between  the  pattern  which  It 
represents  and  the  uiUinown  Input.  A  decision  device  selects  the  largest 

output.  The  outputs  of  the  response  units  ore  fed  back  to  weighting  networks, 

% 

located  between  the  associative  units  and  the  response  units,  in  such  a 
manner  that  the  connections  contributing  to  the  correct  output  ore  strengthened 
and  the  connections  co.ntrlbutlng  to  Incorrect  outputoore  inhibited.  As  the 
stlmulll  ore  sequentially  applied  to  the  sensory  units  (during  the  period  of 
super/lsed  learning),  the  weightings  ore  readjusted  until  the  Perceptron 
approaches  ita  asymptotic  learning  capability. 

In  elementary  taodels  of  the  Perceptron  the  connections  are  linear, 
and  the  aasociativc  units  ore  simple  discrete  representations  of  a  neuron.  In 
more  sophisticated  models,  the  connections  may  be  defined  with  non-linear 
functions,  and  the  assoclotlve  units  have  more  conaple'i  properties.  For  example, 
in  more  advanced  concepts  of  the  Perceptron,  the-  aasoclatlve  unit  Itself  ad¬ 
justs  Its  firing  threshold  and  the  value  of  its  output  in  the  course  of  the 
learning  process. 

Other  Related  Approaches.  The  Perceptron  end  Pandemonium  are  among  the  models 
more  widely  known  (by  name)  in  this  country.  But  there  are  other  approaches 
which  merit  as  much  examination  as  these  two.  .Several  people  in  England 
have  introduced  their  own  models.  Chapman  proposes  a  model  in  which  the 
memory  cell  has  certain  special  properties.  Everytime  a  cell  firea  under 
stimulation  it  modifies  the  structure  of  the  triggering  cells  differently 

from  the  passive  ones.  Uttley  has  done  work  on  conditional  probability 

/ 

computers  and  on  methods  of  classification,  following  the  study  of  the  human 

• 

neni’ous  system.  Models  follow  from  this.  Taylor  has  done  similar  work. 

K 
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la  this  country,  Mattson  has  studied  the  classification  problem  with  a  model 
in  which  he  represents  objects  in  on  n-dimenslonal  binary  space.  He  then 
divides  the  space  into  category  regions  with  a  number  of  hypen^lanes.  The 
''learning"  process  consists  in  best  locating  the  hyperplanes,  that  .'is,  in 
finding  the  coefficients  defining  each  plane.  The  evaluation  is  made  with 
logical  networks.  Simple  models  have  been  constructed.  Stanford  University 
is-  looking  at  the  problem  with  a  similar  viewpoint. 

Nerye  Mets.  McCulloch  and  Pitts  have  approached  the  problem  by  concentrating 
thpir  efforts  on  development  of  a  model  for  a  neuron.  They  then  study  the 
properties  of  the  different  (nerve)  nets  which  con  be  synthesized  with  these 
neuron  models.  Whereas  some  people  begin  with  a  system,  perhaps  resembling 

A 

the  biological  model,  and  try  to  learn  the  requirements  for  Its  structure, 
McCulloch  and  Pitts  start  with  the  basic  element  and  study  its  applications 
In  large  systems.  They  have  directed  their  more  recent  efforts  to  further 
refinement  of  the  basic  nerve  model. 

Character  and  Speech  Recognition.  Most  studies  of  character  and  speech 
repognltlon  have  not  Included  real  learning  techniques,  but  have  rather 
depended  upon  looking  for  particular  features  In  a  pattern  as  suggested  by 
the  ingenuity  of  the  engineer,  f-fattson  and  Rosenblatt  have  appl  ..ed  their 
teclmlques  to  character  recognition,  and  other  work  is  being  done  at  Stanford 
Research  Institute,  Lincoln  Labs,  Bell  Telephone  Labs,  and  at  other  organ¬ 
izations.  One  procedure  which  appears  to  be  widespread  is  the  following': 

the  character  is  quantized  in  its  two-dimensional  display;  noise  is  removed 

\ 

by  an  operation  of  local  averaging,  l.e.,  of  representing  a  box  ('quantum)  by 
the  average  of  itself  and  all  immediately  adjacent  boxes;  the  character  line 
width  is  standardized;  certain  character  features  are  then  extracted  and  the 
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recognition  is  based  upon  the  existence  of  certain  features.  It  is  in  the  feature 
selection  that  the  designer's  ingenuity  is  manifested.  These  features  may 
include  different  line  orientations  and  straight  line  intersections,  such  as 
a  T,  inverted  T,  a  V,  a  slant,  etc.  In  some  techniques  the  original  character 
is  quantized  and  then  converted  to  binary  form,  and  all  subsequent  operations 
ore  logical  (performed  by  a  digital  computer).  In  most  of  the  techniques 
location  of  character  and  size  variations  ore  special  problems. 

The  techniques  of  speech  recognition  ore  nicely  described  in  C.  Cherry's 
book  "On  Human  Communications",  and  in  Bell  Telephone  Laboratories  monographs. 
Speech  is  represented  in  a  two-dimensional  time-frequency  array  which  is  obtained 
by  passing  the  speech  through  an  array  of  staggered  narrow-band  filters  (in  a 
device  known  os  a  VOCODER).  Some  particular  speech  recognition  techniques 
attempt  to  extract  properties  from  these  arrays,  such  properties  including 
formant  frequencies,  etc.  Special  problems  arise  from  the  wide  ranges  in  pitch 
and  word  duration  among  speakers,  in  addition  to  the  other  many  subtleties  of 
the  spoken  word . 

Language  Translation 

Strictly  speaking,  language  translation  is  a  problem  in  more  specific 
identification  rather  than  in  pattern  recognition.  However,  many  of  the  trans¬ 
lation  processes  do  involve  a  search  for  patterns.  For  example,  any  parti¬ 
cular  word  sequence  must  be  examined  to  see  whether  it  has  the  pattern  of  a 
grammatically  correct  sentence i.e.,  is  the  relationship  of  verb,  noun, 
adjective,  etc.  consistent  with  the  rules  of  syntax?. 

Turing  Machines 

One  of  the  early  classical  approaches  to  examination  of  the  potfen- 
tialitles  of  machine  learning  was  introduced  by  Turing.  Claude  Shannon  presents 
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a  description  of  a  Universal  Turing  Machine  in  "Automata  Studies"  (Princeton 
University  Press,  1956),  and  his  introductory  description  is  here  given 
exactly. 

"In  a  uell-know.n  paper^,  A.M.  Turing  defined  a  class  of  computing 
machines  now  known  as  Turing  machines.  Vfe  .may  think  of  a  Turing 
machine  as  composed  of  three  parts  —  a  control  elemeht,  a  reading 
and  writing  head,,  and  an  infinite  tape.  The  tape  is  divided  into  o 
sequence  of  squares  each  of  which  can  carry  any  symbol  from  a  finite 
alphabet.  The  reading  head  will  at  a  given  time  scsn  one  square  of 
the  tape.  It  can  read  the  symbol  written  there  and,  under  directions 
from  the  control  element,  can  write  a  new  symbol  and  also  move  one 
square  to  the  right  or  left.  The  control  element  is  a  device  with  a 
finite  number  of  internal  "states".  At  a  given  time,  the  next 
operation  of  the  machine  is  determined  by  the  current  state  of  the 
control  element  and  the  symbol  that  is  being  read  by  the  reading 
head.  This  operation  will  consist  of  three  parts;  first  the  printing 
of  a  new  symbol  in  the  present  square  (which  may,  of  course,  be  the 
same  as  the  symbol  Just  read);  second,  the  passage  of  the  control 
element  to  a  new  state  (which  may  also  be  the  same  as  the  previous 
state);  and  third,  movement  of  the  reading  head  one  square  to  the 
right  or  left. 

"In  operation,  some  finite  portion  of  the  tape  is  prepared  with  a 
starting  sequence  of  symbols,  the  remainder  of  the  tape  being  left 
blank  (i.e.,  registering  a  particular  "blank'  symbol).  The  reading 
head  is  placed  at  a  particular  starting  square  and  the  machine 
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proceeds  to  coaq^tute  in  accordance  with  its  rules  of  operstion.  In 
Turing's  original  formulation,  alternate  squares  were  reserved  for 
the  final  answer,  the  others  being  used  for  intermediate  calculations. 
This  and  other  details  of  the  original  definition  have  been  varied 
in  later  formulations  of  the  theory. 

"Turing  shoved  that  it  Is  possible  to  design  a  universal  machine 
which  will  be  able  to  net  like  any  particular  Turing  machine  when 
supplied  with  a  descrlptioj;  of  that  machine.  The  description  is 
placed  on  the  tape  of  the  universal  machine  in  accordance  with  a 
certain  code,  as  is  also  the  starting  sequence  of  the  particular 
machine.  The  universal  machine  then  imitates  the  operation  of  the 
particular  machine". 

There  are  many  detailed  variations  in  the  approaches  outlined  above.  An¬ 
other  way  of  looking  n*.  the  subject  is  contributed  by  Bellman  and  Kalaba,  and 
is  built  upon  the  dynamic  progra.Tjning  techniques  which  these  men  have  developed. 
Further  information  cn  pattern  recognition  may  be  outlined  from  the  bibliography. 
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APPENDIX  C 

As  experinsntal  verification  of  the  technique, >•  the  methods  of 
Section  2  iind  3  applied  to  the  machins-learned  recognition  of  spoksn 
minerals.  The  sequence  of  labeled  events  is  a  large  sat  of  numerals,  spoken 
by  different  individuals,  where  each  spoken  word  is  labeled  by  one  of  the  ten 
nuwjrals  it  represents.  An  unlabele'  spoken  word  is  recognized  as  a  specific 
numeral  throu^  its  comparison  to  each  of  the  ten  catoRories  by  the  functions 
devoloned  from  the  labeled  examples.  The  ten  categories  of  spoken  numerals 
”0"  (zero)  through  ''9"  »'ore  reprosentcd  by  I4OO  different  utterances  made  by 
ten  male  spea^ers,  The  ten  male  speakers  have  regional  accents  drawn  from 
the  north-east  corner  <“!  the  United  States.  No  attempt  '.’cs  rade  tc  othendse 
control  the  selection  of  spealcers  or  their  rate  of  speech. 

The  model  of  the  physical  •.*orld  considered  adequate  to  represent 
the  speech  events  is  obtained  through  use  of  an  l8-channol  VocQder.  The 
Vocoder  is  a  set  of  18  stagger-tuned  narrow  band-pass  filters  which  print 
out  the  "in3tanta”.eous"  frequency  spectrum  of  the  speech  event  as  a  function 
of  time.  This  is  showei  in  Figure  C-1,  where  frequency  is  plotted  vertically, 
time  horizontally,  and  the  ■'ntensity  of  the  spectrum  at  a  giv^n  frequency  and 
time  is  proportional  to  the  grey  level  of  sonograph  recording  at  the 
corresponding  time -frequency  point.  Thu  •’umrrlcal  nrint-oLt  of  Figure  C-1 
is  obtained  by  diH-tizlng  the  ^bove  sonograph  records  into  18  frequency 
channels  each  sa’-pled  at  the  rate  of  20  ir  sec/sample.  Mote  that  the  samples 
are  orthogonal  by  construction  because  they  represent  waveforms  that  are 
disjoint  either  in  frconency  or  tine,  Tlie  reoulting  cell  structure  in  the 
time-frequency  plane  represents  a  one-socond  long  sp'.'ech  event  as  a  vector 
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Figure  C~lo  The  Spoken  Word  "Test'’ 
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in  a  900-dimensional  space.  Each  dimension  corresponds  to  a  possible  cell 
location  and  the  coordinate  value  of  a  dimension  is  the  intensity  of  the 
corresponding  cell.  Ih  the  fif^re  a  three  digit  binary  number  (ei^t 
levels)  represents  the  sonograph  intensity,  after  the  instantaneous  total 
speech  intensity  is  normalized  by  the  action  of  a  fast  age. 

A  computer  vas  pronrannod  to  implement  the  tht»ryj  and  an  increasing 
number  of  spoken  numerals  voro  sequentially  introduced  from  which  the 
computer  constructed  the  optimum  metrics.  A  different  metric  was  developed 
to  measure  similarity  to  each  of  the  ten  categories.  Typical  results  of 
the  learning  process  ore  shown  in  Figure  C-2.  This  figure  contains  four 
confusion  matrices  constructed  for  the  cases  where  numeral  reco^ition  was 
learned  from  3»  7,  and  9  examples  of  each  of  the  ten  categories  of  digits,. 

The  ordinate  of  a  cell  in  the  matrix  signifies  the  digit  which  is  spoken, 
tlie  abscissa  denotes  the  decision  of  the  machine,  and  the  number  in  the  cell 
states  the  number  of  instances  in  which  the  stated  decision  was  made.  The 
number  1  in  row  6  and  column  P  of  Figure  C-2c,  for  example,  denotes  the  fact 
that  in  one  Instance  a  spoken  digit  6  was  recognized  as  an  8,  Note  that  the 
error  rate  decreases  as  the  number  of  kno\m  examples  of  categories  is  increased. 
For  the  9  examples  per  category  no  errors  "ere  made.  This  result  is 
particularly  interesting  in  view  of  the  fact  that  the  spoken  digits  which  were 
tested  were  spoken  by  persons  not  included  among  those  whose  words  were  used 
as  examples. 
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